Yahoo is focused on making the world’s daily habits inspiring and entertaining. In making this idea a reality, Yahoo has recently released a number of beautiful and innovative new or revamped products including Yahoo Weather, Yahoo Mail, Yahoo Sports, and Yahoo News Digest. Often, people only get to experience the elegant simplicity of such award-winning apps in as much as they can see them on their mobile devices or desktops. However, making these products effective for hundreds of millions of people requires not only outstanding design and engineering, but also advanced scientific research.
This blog post is the first in our new series called “Science Powering Product” where we will discuss the science that helps make each Yahoo product a rich and enjoyable experience. Our goal is to offer you a deeper understanding of some of your daily habits. Today, we begin with one of the most important - news.
Imagine you are driving home from work and you notice the traffic slowing down to a crawl. The nearest exit is far away, and then you hear sirens, and you see police cars and ambulances. You turn on the car radio, play with all your traffic apps, but have no idea what is happening. You’re stuck for twenty minutes in one place, with no relief in sight. Yes, you could breathe deeply, and listen to the latest NPR news, but wouldn’t it be nice if information about the cause of this traffic snarl was automatically pushed to your phone? Then you could hear a brief spoken summary of what had transpired.
Methods for automatic summarization have been explored since the 1950’s, drawing on various disciplines including artificial intelligence, information retrieval, and natural language processing. Research over the years has attempted to summarize individual documents, collections of documents (‘multi-document summaries’), and even, in some cases, books; summarization systems have been implemented for scientific articles, news, email threads, meeting transcripts, etc. Given a target length for the summary or a ‘compression’ rate, these systems have for the most part extracted snippets from the source document(s), but in some cases the summaries have attempted to revise or reformulate the input in various ways. In some situations such as the scenario above, summaries may have to take into account user context, including what topics the user is interested in and what earlier information the user may have seen. When integrated with location-aware geo-referencing and mapping services, these summarization algorithms can produce summaries that form the basis for the news update in the traffic scenario described above.
The last decade has seen the emergence of new summarization methods and frameworks, and also some progress on the thorny question of how to evaluate summaries. Summarization evaluation methods have factored in both informativeness (how representative the summary is of the content in the source) as well as coherence (ensuring the summary flows together and is readable, without redundancy or ill-formedness). Addressing these evaluation criteria can be a challenge for systems, especially when dealing with more informal genres of natural language.
These difficulties along with the above task constraints mean that developing a commercial summarization product requires a background of serious research expertise in the area. As the science lead on the team at Summly, founded by Nick D’Aloisio, Inderjeet Mani helped develop the world’s first commercial-scale mobile news summarization app. One of the crucial challenges the Summly team faced was scaling up to the sheer heterogeneity of styles and genres of newsfeeds across the world. Fortunately, they were able to address these challenges using highly robust and noise-tolerant machine learning methods along with a unique architecture. The Summly architecture tapped into both unsupervised and supervised machine learning algorithms, and relied on different types of features engineered for different components, along with rigorous evaluations on datasets in multiple languages. In an early interview, Nick (who was 17 at the time) summarized Summly’s approach succinctly: “We worked very hard to create the user interface … but equally, the technology is very robust…we’ve hired the best people in the world to create this algorithm that can take any news article, determine whether or not it’s summarizable, and then produce a coherent paragraph of text automatically with no human intervention that’s very scalable.” The end result was that mobile users got to read and offer feedback on about 90 million summaries of individual news documents in the few months between launch and Summly’s acquisition by Yahoo in April 2013!
One of the team’s first efforts at Yahoo was to help shape the technology into a product for the Yahoo mobile news app, which went live in May 2013. Soon after, Nick and Inderjeet started brainstorming about how to effectively communicate and package a roundup of key stories in the news to mobile users in an engaging manner. They agreed that summarization was too document-centric, and they had learned from Summly that people were willing to consume more content when it was boiled down to the most important bits. Nick, who not only conceived, but was now managing this new project, was also pushing hard for the bits, or ‘atoms’, to be presented with a clean and minimalist look-and-feel, which he began wireframing and then implementing in collaboration with Yahoo’s design team in Mobile and Emerging Products (MEP). In the months that followed, extensive multi-document summarization experiments and evaluations were carried out in Yahoo Labs, along with intensive engineering, algorithm refinement, UI design, and further evaluations within the MEP team.
These synergistic efforts culminated in the launch, in January 2014, of the Yahoo News Digest, which delivers twice a day to your phone a definitive summary of a dozen or fewer need-to-know news stories. Each story corresponds to an automatic cluster of documents on a particular event in the news, and the summarization algorithm takes the cluster and assembles a short multi-document summary (or atom) of the content by selecting sentences within those documents. These textual summaries are integrated with other atoms that include maps, infographics, Wikipedia extracts, videos, photos and more. Instead of having the machine alone determine which of many stories are the ones you need to know, human editors help curate the content by selecting from a ranked list of stories. However, users who want even more stories are offered additional, uncurated content.
Such a cool capability only touches the tip of the iceberg of information that deserves to be summarized! In addition to being able to provide summaries for the initial traffic scenario above, it would be great to factor in additional sources of information such as social media chatter and even multimedia information, especially information found in transcribed speech as well as buried in images, videos, etc. Further downstream may come summarization of movies, fiction, etc. The sheer diversity of such data and the challenges of working across media types can be daunting, but Yahoo Labs is well-positioned to address such problems with robust and rigorous science.
For more on the science of summarization, please see:
Inderjeet Mani. Automatic Summarization. John Benjamins (2001).
Ani Nenkova and Kathleen McKeown. Automatic Summarization. Foundations and Trends in Information Retrieval 5(2-3): 103-233 (2011).