Yahoo DC Welcomes the 2014 YALI Fellows

We’re thrilled to have taken part in this wonderful endeavor!

yahoopolicy:

By Tekedra Mawakana, Vice President of Global Public Policy, Deputy General Counsel

image

(Tekedra Mawakana, Yahoo VP of Global Public Policy, together with 2014 YALI Fellows and Dr. Sonya Smith of Howard University)

There’s no doubt that young people are the future of Africa: nearly 1 in…

Science Powering Product: Yahoo News Digest

Yahoo is focused on making the world’s daily habits inspiring and entertaining. In making this idea a reality, Yahoo has recently released a number of beautiful and innovative new or revamped products including Yahoo Weather, Yahoo Mail, Yahoo Sports, and Yahoo News Digest. Often, people only get to experience the elegant simplicity of such award-winning apps in as much as they can see them on their mobile devices or desktops. However, making these products effective for hundreds of millions of people requires not only outstanding design and engineering, but also advanced scientific research.

This blog post is the first in our new series called “Science Powering Product” where we will discuss the science that helps make each Yahoo product a rich and enjoyable experience. Our goal is to offer you a deeper understanding of some of your daily habits. Today, we begin with one of the most important - news.

image

Imagine you are driving home from work and you notice the traffic slowing down to a crawl. The nearest exit is far away, and then you hear sirens, and you see police cars and ambulances. You turn on the car radio, play with all your traffic apps, but have no idea what is happening. You’re stuck for twenty minutes in one place, with no relief in sight. Yes, you could breathe deeply, and listen to the latest NPR news, but wouldn’t it be nice if information about the cause of this traffic snarl was automatically pushed to your phone? Then you could hear a brief spoken summary of what had transpired.

Methods for automatic summarization have been explored since the 1950’s, drawing on various disciplines including artificial intelligence, information retrieval, and natural language processing. Research over the years has attempted to summarize individual documents, collections of documents (‘multi-document summaries’), and even, in some cases, books; summarization systems have been implemented for scientific articles, news, email threads, meeting transcripts, etc. Given a target length for the summary or a ‘compression’ rate, these systems have for the most part extracted snippets from the source document(s), but in some cases the summaries have attempted to revise or reformulate the input in various ways. In some situations such as the scenario above, summaries may have to take into account user context, including what topics the user is interested in and what earlier information the user may have seen. When integrated with location-aware geo-referencing and mapping services, these summarization algorithms can produce summaries that form the basis for the news update in the traffic scenario described above.

The last decade has seen the emergence of new summarization methods and frameworks, and also some progress on the thorny question of how to evaluate summaries. Summarization evaluation methods have factored in both informativeness (how representative the summary is of the content in the source) as well as coherence (ensuring the summary flows together and is readable, without redundancy or ill-formedness). Addressing these evaluation criteria can be a challenge for systems, especially when dealing with more informal genres of natural language.

These difficulties along with the above task constraints mean that developing a commercial summarization product requires a background of serious research expertise in the area. As the science lead on the team at Summly, founded by Nick D’Aloisio, Inderjeet Mani helped develop the world’s first commercial-scale mobile news summarization app. One of the crucial challenges the Summly team faced was scaling up to the sheer heterogeneity of styles and genres of newsfeeds across the world.  Fortunately, they were able to address these challenges using highly robust and noise-tolerant machine learning methods along with a unique architecture. The Summly architecture tapped into both unsupervised and supervised machine learning algorithms, and relied on different types of features engineered for different components, along with rigorous evaluations on datasets in multiple languages. In an early interview, Nick (who was 17 at the time) summarized Summly’s approach succinctly: “We worked very hard to create the user interface … but equally, the technology is very robust…we’ve hired the best people in the world to create this algorithm that can take any news article, determine whether or not it’s summarizable, and then produce a coherent paragraph of text automatically with no human intervention that’s very scalable.” The end result was that mobile users got to read and offer feedback on about 90 million summaries of individual news documents in the few months between launch and Summly’s acquisition by Yahoo in April 2013!

One of the team’s first efforts at Yahoo was to help shape the technology into a product for the Yahoo mobile news app, which went live in May 2013. Soon after, Nick and Inderjeet started brainstorming about how to effectively communicate and package a roundup of key stories in the news to mobile users in an engaging manner. They agreed that summarization was too document-centric, and they had learned from Summly that people were willing to consume more content when it was boiled down to the most important bits. Nick, who not only conceived, but was now managing this new project, was also pushing hard for the bits, or ‘atoms’, to be presented with a clean and minimalist look-and-feel, which he began wireframing and then implementing in collaboration with Yahoo’s design team in Mobile and Emerging Products (MEP). In the months that followed, extensive multi-document summarization experiments and evaluations were carried out in Yahoo Labs, along with intensive engineering, algorithm refinement, UI design, and further evaluations within the MEP team.

These synergistic efforts culminated in the launch, in January 2014, of the Yahoo News Digest, which delivers twice a day to your phone a definitive summary of a dozen or fewer need-to-know news stories. Each story corresponds to an automatic cluster of documents on a particular event in the news, and the summarization algorithm takes the cluster and assembles a short multi-document summary (or atom) of the content by selecting sentences within those documents. These textual summaries are integrated with other atoms that include maps, infographics, Wikipedia extracts, videos, photos and more. Instead of having the machine alone determine which of many stories are the ones you need to know, human editors help curate the content by selecting from a ranked list of stories. However, users who want even more stories are offered additional, uncurated content.

Such a cool capability only touches the tip of the iceberg of information that deserves to be summarized! In addition to being able to provide summaries for the initial traffic scenario above, it would be great to factor in additional sources of information such as social media chatter and even multimedia information, especially information found in transcribed speech as well as buried in images, videos, etc.  Further downstream may come summarization of movies, fiction, etc. The sheer diversity of such data and the challenges of working across media types can be daunting, but Yahoo Labs is well-positioned to address such problems with robust and rigorous science.

 

For more on the science of summarization, please see:

Inderjeet Mani. Automatic Summarization. John Benjamins (2001).

Ani Nenkova and Kathleen McKeown. Automatic Summarization. Foundations and Trends in Information Retrieval 5(2-3): 103-233 (2011).

One Hundred Million Creative Commons Flickr Images for Research

by David A. Shamma

Today the photograph has transformed again. From the old world of unprocessed rolls of C-41 sitting in a fridge 20 years ago to sharing photos on the 1.5” screen of a point and shoot camera 10 years back. Today the photograph is something different. Photos automatically leave their capture (and formerly captive) devices to many sharing services. There are a lot of photos. A back of the envelope estimation reports 10% of all photos in the world were taken in the last 12 months, and that was calculated three years ago. And of these services, Flickr has been a great repository of images that are free to share via Creative Commons.

On Flickr, photos, their metadata, their social ecosystem, and the pixels themselves make for a vibrant environment for answering many research questions at scale. However, scientific efforts outside of industry have relied on various sized efforts of one-off datasets for research. At Flickr and at Yahoo Labs, we set out to provide something more substantial for researchers around the globe.

YFCC100M Data, data, data… A glimpse of a small piece of the dataset.Creative Commons License YFCC100M by aymanshamma on Flickr.

Today, we are announcing the Flickr Creative Commons dataset as part of Yahoo Webscope’s datasets for researchers. The dataset, we believe, is one of the largest public multimedia datasets that has ever been released—99.3 million images and 0.7 million videos, all from Flickr and all under Creative Commons licensing.

The dataset (about 12GB) consists of a photo_id, a jpeg url or video url, and some corresponding metadata such as the title, description, title, camera type, title, and tags. Plus about 49 million of the photos are geotagged! What’s not there, like comments, favorites, and social network data, can be queried from the Flickr API.

One Million Creative Commons Geo-tagged Photos A 1 million photo sample of the 48 million geotagged photos from the dataset plotted around the globe. Creative Commons License One Million Creative Commons Geo-tagged Photos by aymanshamma on Flickr.

But of course, processing 100 million images takes a fair bit of processing power, time, and resources that not every research institute has. To aid here, we’ve worked with the International Computer Science Institute (ICSI) at Berkeley and Lawrence Livermore National Laboratory to compute many open standardized computer vision and audio features†, which we plan to host in a shared Amazon Instance, as it’s somewhere north of 50TB, for researchers around the world to use. It’s pretty intense and they brought in a first-of-its-kind supercomputer, the Cray Catalyst, to make the calculations.

The dataset is available now!

The dataset can host a variety of research studies and challenges. One of the first challenges we are issuing is the MediaEval Placing Task, where the task is to build a system capable of accurately predicting where in the world the photos and videos were taken without using the longitude and latitude coordinates. This is just the start. We plan to create new challenges through expansion packs that will widen the scope of the dataset with new tasks like object localization, concept detection, and social semantics.

Interested? Head over to the Yahoo Webscope site to request the dataset. If you have any questions, you can get those answered there as well.

Happy Researching!


† In case you’re curious: SIFT, GIST, Auto Color Correlogram, Gabor Features, CEDD, Color Layout, Edge Histogram, FCTH, Fuzzy Opponent Histogram, Joint Histogram, Kaldi Features, MFCC, SACC_Pitch, and Tonality.

Faculty Research And Engagement Program 2014 Recipients Selected

image

It is in the spirit of collaboration and desire to discover answers to complex problems that we are excited to announce the recipients of the 2014 Yahoo Faculty Research and Engagement Program (FREP) award. This academic outreach initiative is designed to produce the highest quality scientific collaborations and outcomes by engaging with faculty and students conducting research in areas of mutual interest. The FREP awards hundreds of thousands of dollars in unrestricted gifts to support new, exciting Internet research studies and experiments between academics across the globe and their Yahoo research scientist counterparts.

Over the course of the next year and beyond, FREP award recipients and Yahoo Labs scientists will work closely to further research in their mutual areas of interest. Yahoo Labs Research Scientist Mihajlo Grbovic will be working with Stanford University Assistant Professor Jure Leskovec on the network-based detection of potentially compromised accounts in Yahoo Mail and Tumblr. Mihajlo and Jure further discuss their research and the FREP award here:

We were extremely impressed with all of the submissions and would like to thank each professor who applied. Congratulations to the following recipients of the Yahoo 2014 Faculty Research and Engagement Program:

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

At Yahoo Labs, we’re committed to forging strong alliances with top faculty by collaborating on cutting edge research to advance Web Science. These collaborations will solve shared problems with measurable outcomes such as joint papers, advances in algorithm design, systems research, digital media studies, and marketplace design. FREP supports all areas of Yahoo Labs research.

If you have questions about the Faculty Research and Engagement Program, please contact Kim Capps.

Goalr! The Science of Predicting the World Cup on Tumblr

By Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, and Narayan Bhamidipati

With the 2014 FIFA World Cup kicking off on June 12, billions of fans across the world are turning their attention toward host country Brazil to root for their favorite teams. Soccer (or, if you prefer, football) fans are loud; you need only remember the last World Cup’s infamous vuvuzelas for a demonstration. But fans aren’t only loud in stadiums. They also make their voices heard across social media. And though you may assume these fans are just blowing their vuvuzelas into the social abyss, if you listen closely, you’ll discover a treasure trove of data — including possibly an answer to the most important question of all: Who will win?

As soccer fans and Yahoo Labs scientists with access to Tumblr data, we wanted to find out if we could take advantage of our unique insight to comb through an ocean of posts to predict a World Cup winner. And we have! But before we share our prediction on which nation will get to revel in World Cup glory, we’ll tell you how we figured it out.

Sifting through 188.9 million Tumblr blogs comprising 83.1 billion posts to find World Cup-related content wasn’t easy. To begin, we used two main parameters to determine which content was relevant: posts with hashtags referencing #WorldCup, #World Cup, #Copa do mundo (or other variants outlined in our technical report), and posts with hashtags referencing #soccer, #football, #futbol, etc.

However, using these parameters alone proved too broad. So once we isolated #WorldCup-related posts, we checked the bodies of the posts for mentions of country names. Then we did the same for #soccer-related posts (for Team USA, we counted only mentions in #soccer posts to avoid confusion with American football. For Team Brazil, we discounted a percentage of posts due the country hosting the event and thus receiving extra mentions — this was a percentage calculated based on an editorial evaluation on a sample of posts).

image

To get even more representative results, we checked the bodies of posts in both hashtag categories for mentions of any national team player according to FIFA’s official list of players for each nation.

image

Upon completion of our filtering, we were left with 27.3 million relevant posts from February through May. The fun (read: science-y) part came next.

In order to figure out how each country will stack up against each other, we needed to assign values of strength to each team. These values were calculated according to each matchup and provided a representative game score. More specifically, when two teams are positioned to play against each other, we estimated the number of goals scored by each team using a Poisson distribution with four differently-weighted parameters learned using the Maximum Likelihood algorithm on prior games (qualifications, friendlies, etc.). The four parameters included these: 1. Team mentions in #WorldCup-related posts, 2. Team mentions in #soccer-related posts, 3. The average number of player mentions per team, and 4. The standard deviation of player mentions per team.

image

Finally, we were left with a statistical model predicting the outcome of each successive matchup based on our calculations. Taking into account the 27.3 million relevant posts, we had a complete bracket and a winner: Team Brazil.

image

Do you agree with our prediction? Think some other team will win? Make sure to check back to see how well the World Cup social frenzy on Tumblr predicted the outcome.

Rashmi Mohan Elected to ACM India Council

image

We are proud to announce that our Senior Manager of Research Engineering, Rashmi Mohan, has been elected to the Association for Computing Machinery’s (ACM) India Council. Rashmi will begin her four-year term as a Member-at-Large on July 1.

The ACM describes the India Council as an “effort of ACM aimed at increasing the level and visibility of ACM activities across India.” They say that “the ACM community in India is growing in membership, number of chapters, sponsored conferences and symposia.”

According to their website, the ACM in India focuses on the following range of activities that comprise a “cross section of the computer science and information technology community”:

  • serving as a professional network for individuals who are involved with the science and technology of computing
  • encouraging students to take an active interest in the emerging and exciting world of computing
  • facilitating the organization of high-quality ACM conferences in India
  • providing logistical support to grow more ACM professional and student chapters
  • enhancing access to the ACM Digital Library and publications for ACM members in India
  • increasing the participation of ACM members in India across all dimensions of ACM

Rashmi is based in our Bangalore office and has spent the past thirteen years in various technical and management roles at Yahoo. She has a background in front end engineering and leads a group of research engineers and scientists within Yahoo Labs. Rashmi is also the president of the Women in Tech (WIT) Bangalore chapter and is an active volunteer with the Grace Hopper Conference in India.

Dr. Jiawei Han Discusses How “Big Data Needs Big Structure”

image

On Thursday, we were honored to have had Dr. Jiawei Han, Abel Bliss Professor of Computer Science at the University of Illinois at Urbana-Champaign, present a Big Thinkers talk at Yahoo entitled, “Construction, Exploration and Mining of Semi-Structured, Heterogeneous Information Networks.” Dr. Han’s presentation focused on the necessity of mining typed, heterogeneous information networks in order to uncover considerable knowledge from interconnected data. Dr. Han talked about how when mining heterogeneous information networks, one needs to treat each term as “first-class citizen” (just diff types); how in heterogeneous information networks, different meta-paths carry rather different semantics (giving diff results); and how meta path relationships among similar-typed links share similar semantics and are comparable and inferable.

The event was broadcast live on our labs.yahoo.com homepage and viewers had the opportunity to ask questions and comment on our Twitter stream @YahooLabs as well as our Facebook page.

You can view Dr. Han’s full presentation here:

Getting to Know You and Your Flickr Photos Better with Research

image

By Saeideh Bakhshi

Mobile phone photography has dramatically risen in popularity recently. For example, the various iPhone models have been Flickr’s most popular cameras for years. Part of the success of mobile camera phone sharing is attributed to the use of on-camera visual effects. These effects, or filters, provide a quick preset path to an artistic rendering of the photo. Mobile photo sharing sites, such as Flickr and Instagram, provide several filter options; yet, despite their widespread use and the Human-Computer Interaction (HCI) community’s interest in mobile photography, there is little work scholarly or otherwise around filters, their use, and their effect on photo-sharing communities.

Understanding how people create photos, share them, and engage with visual content can create opportunities for better interface design, and suggest insights into social interactions via social media sites.

When conducting research, scientists have many tools at their disposal. However, there are often questions that cannot be answered without direct feedback from users. In an ongoing research study, we aim to understand the effect of filters on the popularity and social engagement of Flickr photos. And while we can estimate some of these effects with large scale data analysis, understanding motivations and perceptions of the users requires direct communication with them. That is why we are calling on Flickr users to help us in better understanding how they use filters.

If you use the Flickr mobile app, we want to talk to you. If you are interested in helping us help you make your Flickr photo experience better (or know someone who is), please get back to us by completing the following survey: https://www.surveymonkey.com/s/P67VSCG.