Great work by our Research Scientist Karen Church from our Mobile Sensing and User Behavior Research group!
What if there were a smarter way to expand your social network? That is to say, what if you were offered a reason to follow someone beyond “people you may know” on Flickr, Facebook, and LinkedIn, “recommended blogs” on Tumblr, and “who to follow” on Twitter, for example?
The recommendation of users to other users is one of the most fundamental and important features of online social-networking platforms. This capability helps people get a faster start in building their networks, which as a result drives engagement and loyalty. Given that growing the user-base and maintaining a high level of engagement are key factors for the success (or death) of these billion-dollar businesses, the importance of user-recommendation systems is clear.
These systems usually exploit the structure of a given social graph in order to predict which new connections, or social links, are likely to appear in the future: This task is known as “link prediction” in machine learning literature. As an example, the most basic form of a rule for link prediction is “triadic closure”: If person A follows person B, and person B follows person C, then it is likely that A might be interested in following C as well. Of course, much more sophisticated techniques and algorithms are behind the real-world systems we mention above. In fact, a social-networking platform maintains much more information than just a social graph (who is connected to whom). The system may also know which school you frequented, the places you live and work, and the music and movies you like. While this wealth of information is surely utilized to recommend connections, it is not capitalized upon to explain a recommendation.
Enhancing recommendations with explanations adds an important layer of trust for someone using a social network. And when someone has a more compelling explanation for a suggestion, they are more likely to click “follow” and expand their network. While this premise is well understood in classic collaborative-filtering recommender systems, providing explanations in the context of the user-recommendation systems we describe is still largely underdeveloped. In fact, in most real-world systems, the explanations given for user recommendations are something along the lines of, “You should follow person Z because your contacts X and Y do the same.”
Based on this consideration, in our recent research paper at the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), we introduce “Link Prediction with Explanation,” a novel machine learning task for which we devised a model dubbed, “Who to Follow and Why.” Our foundational observation is rooted in sociology, where it is known as “common identity and common bond theory.” In simple terms, people create connections either due to common interests (topical links) or based upon common social settings such as family, work, school, etc. (social links). Our framework not only recommends new social connections, but for each suggested link it decides whether it is topical or social, and depending on this decision, the system produces a different type of explanation.
Based on this premise, our Who to Follow and Why framework works as follows: A topical link is usually recommended to person A when person B is authoritative in a topic in which A has demonstrated interest. In this case the explanation takes the form of keywords (e.g., tags on Flickr or hashtags on Twitter) in which A shows interest and B is authoritative.
A social link is recommended when A and B are already part of the same social community, i.e., they have many common contacts. In this instance, the explanation is the standard for current user-recommender systems: A should follow B because they have common acquaintances.
As an important by-product, Who to Follow and Why also implicitly detects communities of users and whether they are social or topical. “Community detection” is arguably the most important topic of study in the analysis of social networks. There are obvious associated sociological interests, but there are also many concrete applications, for example in advertising. Furthermore, our model can describe the detected communities with the most relevant keywords, and for topical ones, can also identify the most influential users.
Though our research is still in its early stages, our framework for the recommendation of topical connections and their correlating explanations could help users discover other users that produce more interesting content than they would have otherwise come across; that type of discovery is much more difficult via current social links. And of course, the benefits for the social platforms themselves could be huge.
We all know how busy the world is today. People race around from place to place trying to shave off minutes from their commutes in order to squeeze in more time for other things. But what if you had a happier, more pleasant journey?
In a previous Tumblr post called “Can Cities Make us Happy?”, we summarized our preliminary work on which urban elements make people happy. We found that in London, for example, people associate public gardens and Victorian and red brick houses with beauty and happiness, and that cars and fortress-like buildings are associated with sadness.
In our latest research we put those insights to practical use in the form of maps and routes. Consider that existing mapping technologies return shortest directions. Now, imagine a new mapping tool that, instead of suggesting the shortest walking course from A to B, is able to suggest a route that is both short and pleasant. Based on our previous work, we were able to design algorithms that automatically map out the most beautiful, quiet, and happy routes between two points. Taking into account an average of the results of the three algorithms, our study showed that despite being 12% longer in length and roughly 7 and a half minutes longer in time, respondents preferred the option of taking more scenic, quiet, and happy routes.
More interestingly perhaps, our study participants in London and Boston loved to attach memories to places: both personal memories (e.g., “This is the street I gave my first kiss.”) and shared memories (e.g., “That’s where the old BBC building was.”) In “Remembrance of Things Past,” French novelist Marcel Proust described how a small bite of a madeleine cake unleashed a cascade of memories from childhood. In a similar way, our participants found places to be pleasant (or not) and memorable depending on the way they smelled and sounded. It turns out that these smells and sounds also play a role in the paths people take from one place to another. This point begs a new question with fascinating implications for the research community: What if we had a mapping tool that suggested pleasant routes based not only on aesthetics, but also on memories, smells, and sounds?
Our study produced one other compelling point worth mentioning — participants pointed out that the experience of a place changes during the course of a day. For example, one of our London participants commented, “Fleet street is beautiful because of its history. However, depending on the time of day, it can be colorless and busy leading to the opposite results.” The idea that the pleasantness of routes differs depending on the daily course of the sun, variance in temperature, and noise level is extremely insightful and nuanced.
As we continue to research the shortest path to happiness, we’re thinking about all these questions. If you find this concept as interesting as we do and live in Berlin, Boston, London, or Turin, then we’d love for you to share your memories around a few paths in your city here. You’ll be helping us with our research, and hopefully making people’s paths happier.
Mark your calendars! Dr. Ben Shneiderman, Professor of Computer Science and Founding Director of the Human-Computer Interaction Laboratory at the University of Maryland, is coming to Yahoo - and your computer screen - on September 18.
His talk is entitled: “Information Visualization for Knowledge Discovery: Big Insights from Big Data”
Interactive information visualization tools provide researchers with remarkable capabilities to support discovery from Big Data resources. Users can begin with an overview, zoom in on areas of interest, filter out unwanted items, and then click for details-on-demand. The Big Data initiatives and commercial success stories such as Spotfire and Tableau, plus widespread use by prominent sites such as The New York Times have made visualization a key technology.
The central theme is the integration of statistics with visualization as applied for time series data, temporal event sequences such as electronic health records (http://www.cs.umd.edu/hcil/eventflow), and social network data (http://www.codeplex.com/nodexl). By temporal pattern search & replace and network motif simplification, complex data streams can be analyzed to find meaningful patterns and important exceptions. The talk closes with 8 Golden Rules for Big Data.
Ben Shneiderman (http://www.cs.umd.edu/~ben) is a Distinguished University Professor in the Department of Computer Science and Founding Director (1983-2000) of the Human-Computer Interaction Laboratory (http://www.cs.umd.edu/hcil/) at the University of Maryland. He is a Fellow of the AAAS, ACM, and IEEE, and a Member of the National Academy of Engineering, in recognition of his pioneering contributions to human-computer interaction and information visualization. His contributions include the direct manipulation concept, clickable web-link, touchscreen keyboards, dynamic query sliders for Spotfire, development of treemaps, innovative network visualization strategies for NodeXL, and temporal event sequence analysis for electronic health records.
Ben is the co-author with Catherine Plaisant of Designing the User Interface: Strategies for Effective Human-Computer Interaction (5th ed., 2010) http://www.awl.com/DTUI/. With Stu Card and Jock Mackinlay, he co-authored Readings in Information Visualization: Using Vision to Think (1999). His book Leonardo’s Laptop appeared in October 2002 (MIT Press) and won the IEEE book award for Distinguished Literary Contribution. His latest book, with Derek Hansen and Marc Smith, is Analyzing Social Media Networks with NodeXL(http://www.codeplex.com/nodexl, 2010).
YAHOO LABS BIG THINKERS SPEAKER SERIES
Yahoo Labs is proud to bring you its 2014 Big Thinkers Speaker Series. Each year, some of the most influential, accomplished experts from the research community visit our campus to share their insights on topics that are significant to Yahoo. These distinctive speakers are shaping the future of the new sciences underlying the Web and are guaranteed to inform, enlighten, and inspire.
By Don McGillen
Have you ever wondered what happens to computer servers if you shake them really hard? The loss of functionality of data and telecommunication centers could have a disastrous impact on emergency operations and in the ability of communities to respond and recover when an earthquake hits. That’s why Howard University Associate Professor Claudia Marin-Artieda thinks about it all the time. In fact Dr. Marin-Artieda, who works in Howard’s Civil and Environmental Engineering Department, received a National Science Foundation (NSF) Career Award to study seismic protection systems for equipment and components in multi-story facilities that include data centers and computer-based communication centers. And now we’re helping her find out the answer to that question.
Our Academic Relations (AR) team has been working hard to develop a strong and rich relationship with Howard University. Of the Historically Black Colleges and Universities (HBCUs), Howard is arguably the most elite in Computer Science, and the only HBCU to offer a PhD in the subject. Over the past year our AR team – along with Yahoo colleagues in our Washington, D.C. office – has partnered with Howard on a number of exciting initiatives, including recently hosting 25 young future leaders from various African nations as part of the Obama administration’s Young African Leaders Initiative.
As part of another program called Yahoo Servers to Academic Researchers (YSTAR), we donated 125 servers from our data centers to Howard last fall. The gift is enabling education initiatives never before possible at the university, and is spurring research with partnering institutions like the State University of New York at Buffalo through the Network for Earthquake Engineering Simulation (NEES) sponsorship.
It is at the University at Buffalo that full-scale laboratory tests are currently being conducted on a frame and 40 servers donated by Yahoo. The seismic performance of the Yahoo frame will be tested on its own and supported on seismic isolated platforms under three-directional earthquake shaking. Dr. Marin-Artieda says, “The studies will provide valuable information regarding the validation of seismic solutions to achieve a desired protection level in essential facilities that are currently lacking. These studies are relevant since they will provide data on 1) deformation levels under severe earthquake shaking that are imposed to equipment-systems in essential facilities to achieve functionality requirements, 2) experimental data on systems characterization that is currently lacking, 3) validation of seismic solutions to achieve a desired protection level, etc.”
Marin adds that, “Implementing the seismic protective options emerging from this research will reduce the vulnerabilities during and after an earthquake of data centers and telecommunication centers. The research is directly addressing critical needs of the earthquake engineering community by validating high-performance options to protect equipment and components of essential facilities.”
At Yahoo Labs, through the engagement of our Academic Relations team, we are thrilled to support such crucial research with such high-stakes, real-world impact. And since our headquarters is in one of the world’s most earthquake-prone locations, this study holds a special place in our heart!
For more on Dr. Claudia Marin-Artieda’s work, please see http://www.howard.edu/seismicpps/.
By Suju Rajan
Replay the day of the World Cup final. Were you a super-fan engaging in the collective sounds of despair and hope, egging on your favorite team? Or were you “Neymar or nothing” when Neymar got injured, becoming indifferent to the outcome? Maybe you were “show me the scores” and just wanted to know the end-result… or perhaps even that didn’t matter. If you placed yourself in any of the above categories, where did you go for your World Cup news fix? With so many options, was it difficult for you to make a decision?
Traditional news sites attempt to cater to their specific audiences. For instance, the two portals SF Gate and San Jose Mercury News in the San Francisco Bay Area focused more on local events of that day while paying a nod to the World Cup.
Portals with a larger demographic such as The New York Times placed more importance on the tournament.
Over the years such sites have learned what attracts their target demographic and perhaps even to optimize for it.
Knowing how best to cater to a specific target demographic is a well-defined problem. Personalizing for a global audience so each and every user has a stellar experience, like we do at Yahoo, is much more difficult (and measuring user satisfaction and optimizing for it requires a separate blog post in itself). But why is this problem so hard? Wouldn’t a traditional recommendation system work in this context?
Recommendation systems work best in the space of items (e.g. movies on Netflix, products on Amazon, songs on Pandora, etc.) that have a long shelf life. These items accumulate enough feedback over time to continue to be valid recommendation candidates in the future. News articles, on the other hand, have a much shorter shelf life, which makes it necessary for them to be recommended in a timely fashion. Typical solutions recommend news articles that are similar to ones you have already read. However, such “nearest-neighbors solutions” are obviously not optimal.
At Yahoo Labs, we want to answer why you read an article and in what context, so that when you go anywhere on the Yahoo site, we will know to how to show you what you are most interested in seeing. And to make your Yahoo experience more robust and delightful, our system doesn’t just recommend news articles, but also mobile apps, videos, and blogs.
So how does such a system work? To begin, we attempt to understand — or at least form a hypothesis on — why you read a given article in the first place. Then we continually test and refine that hypothesis. This user-understanding component needs not only to nail the space in which to represent all the interests of the users, but also to figure out the temporal aspects: Is your fascination with Argentina’s Lionel Messi a short-term or a long-term interest? Do you want to know if Messi sneezes or you are oblivious to his existence until the next World Cup in 2018?
Besides modeling user interests, this complex personalization system needs to rank the right articles in the right context. What do you prefer when you use your mobile phone to scan news: text or video? What do you like to read in the morning: local news or celebrity gossip? Do you prefer reading local news on a specific app? How do your preferences vary between your tablet or desktop? How can we also account for your news feed scroll behavior on your tablet as opposed to a desktop?
The Personalization Science team at Yahoo Labs is working hard to answer all of these questions and many more in order to provide our users with the best possible online experience. If the problem of identifying subsets of users who want to know about Messi’s sneezing (or not!) piques your interest, then stay tuned for our team’s future blog posts as we explain the science behind personalizing Yahoo.
We’re thrilled to have taken part in this wonderful endeavor!
By Tekedra Mawakana, Vice President of Global Public Policy, Deputy General Counsel
(Tekedra Mawakana, Yahoo VP of Global Public Policy, together with 2014 YALI Fellows and Dr. Sonya Smith of Howard University)
There’s no doubt that young people are the future of Africa: nearly 1 in…
Yahoo is focused on making the world’s daily habits inspiring and entertaining. In making this idea a reality, Yahoo has recently released a number of beautiful and innovative new or revamped products including Yahoo Weather, Yahoo Mail, Yahoo Sports, and Yahoo News Digest. Often, people only get to experience the elegant simplicity of such award-winning apps in as much as they can see them on their mobile devices or desktops. However, making these products effective for hundreds of millions of people requires not only outstanding design and engineering, but also advanced scientific research.
This blog post is the first in our new series called “Science Powering Product” where we will discuss the science that helps make each Yahoo product a rich and enjoyable experience. Our goal is to offer you a deeper understanding of some of your daily habits. Today, we begin with one of the most important - news.
Imagine you are driving home from work and you notice the traffic slowing down to a crawl. The nearest exit is far away, and then you hear sirens, and you see police cars and ambulances. You turn on the car radio, play with all your traffic apps, but have no idea what is happening. You’re stuck for twenty minutes in one place, with no relief in sight. Yes, you could breathe deeply, and listen to the latest NPR news, but wouldn’t it be nice if information about the cause of this traffic snarl was automatically pushed to your phone? Then you could hear a brief spoken summary of what had transpired.
Methods for automatic summarization have been explored since the 1950’s, drawing on various disciplines including artificial intelligence, information retrieval, and natural language processing. Research over the years has attempted to summarize individual documents, collections of documents (‘multi-document summaries’), and even, in some cases, books; summarization systems have been implemented for scientific articles, news, email threads, meeting transcripts, etc. Given a target length for the summary or a ‘compression’ rate, these systems have for the most part extracted snippets from the source document(s), but in some cases the summaries have attempted to revise or reformulate the input in various ways. In some situations such as the scenario above, summaries may have to take into account user context, including what topics the user is interested in and what earlier information the user may have seen. When integrated with location-aware geo-referencing and mapping services, these summarization algorithms can produce summaries that form the basis for the news update in the traffic scenario described above.
The last decade has seen the emergence of new summarization methods and frameworks, and also some progress on the thorny question of how to evaluate summaries. Summarization evaluation methods have factored in both informativeness (how representative the summary is of the content in the source) as well as coherence (ensuring the summary flows together and is readable, without redundancy or ill-formedness). Addressing these evaluation criteria can be a challenge for systems, especially when dealing with more informal genres of natural language.
These difficulties along with the above task constraints mean that developing a commercial summarization product requires a background of serious research expertise in the area. As the science lead on the team at Summly, founded by Nick D’Aloisio, Inderjeet Mani helped develop the world’s first commercial-scale mobile news summarization app. One of the crucial challenges the Summly team faced was scaling up to the sheer heterogeneity of styles and genres of newsfeeds across the world. Fortunately, they were able to address these challenges using highly robust and noise-tolerant machine learning methods along with a unique architecture. The Summly architecture tapped into both unsupervised and supervised machine learning algorithms, and relied on different types of features engineered for different components, along with rigorous evaluations on datasets in multiple languages. In an early interview, Nick (who was 17 at the time) summarized Summly’s approach succinctly: “We worked very hard to create the user interface … but equally, the technology is very robust…we’ve hired the best people in the world to create this algorithm that can take any news article, determine whether or not it’s summarizable, and then produce a coherent paragraph of text automatically with no human intervention that’s very scalable.” The end result was that mobile users got to read and offer feedback on about 90 million summaries of individual news documents in the few months between launch and Summly’s acquisition by Yahoo in April 2013!
One of the team’s first efforts at Yahoo was to help shape the technology into a product for the Yahoo mobile news app, which went live in May 2013. Soon after, Nick and Inderjeet started brainstorming about how to effectively communicate and package a roundup of key stories in the news to mobile users in an engaging manner. They agreed that summarization was too document-centric, and they had learned from Summly that people were willing to consume more content when it was boiled down to the most important bits. Nick, who not only conceived, but was now managing this new project, was also pushing hard for the bits, or ‘atoms’, to be presented with a clean and minimalist look-and-feel, which he began wireframing and then implementing in collaboration with Yahoo’s design team in Mobile and Emerging Products (MEP). In the months that followed, extensive multi-document summarization experiments and evaluations were carried out in Yahoo Labs, along with intensive engineering, algorithm refinement, UI design, and further evaluations within the MEP team.
These synergistic efforts culminated in the launch, in January 2014, of the Yahoo News Digest, which delivers twice a day to your phone a definitive summary of a dozen or fewer need-to-know news stories. Each story corresponds to an automatic cluster of documents on a particular event in the news, and the summarization algorithm takes the cluster and assembles a short multi-document summary (or atom) of the content by selecting sentences within those documents. These textual summaries are integrated with other atoms that include maps, infographics, Wikipedia extracts, videos, photos and more. Instead of having the machine alone determine which of many stories are the ones you need to know, human editors help curate the content by selecting from a ranked list of stories. However, users who want even more stories are offered additional, uncurated content.
Such a cool capability only touches the tip of the iceberg of information that deserves to be summarized! In addition to being able to provide summaries for the initial traffic scenario above, it would be great to factor in additional sources of information such as social media chatter and even multimedia information, especially information found in transcribed speech as well as buried in images, videos, etc. Further downstream may come summarization of movies, fiction, etc. The sheer diversity of such data and the challenges of working across media types can be daunting, but Yahoo Labs is well-positioned to address such problems with robust and rigorous science.
For more on the science of summarization, please see:
Inderjeet Mani. Automatic Summarization. John Benjamins (2001).
Ani Nenkova and Kathleen McKeown. Automatic Summarization. Foundations and Trends in Information Retrieval 5(2-3): 103-233 (2011).
Today the photograph has transformed again. From the old world of unprocessed rolls of C-41 sitting in a fridge 20 years ago to sharing photos on the 1.5” screen of a point and shoot camera 10 years back. Today the photograph is something different. Photos automatically leave their capture (and formerly captive) devices to many sharing services. There are a lot of photos. A back of the envelope estimation reports 10% of all photos in the world were taken in the last 12 months, and that was calculated three years ago. And of these services, Flickr has been a great repository of images that are free to share via Creative Commons.
On Flickr, photos, their metadata, their social ecosystem, and the pixels themselves make for a vibrant environment for answering many research questions at scale. However, scientific efforts outside of industry have relied on various sized efforts of one-off datasets for research. At Flickr and at Yahoo Labs, we set out to provide something more substantial for researchers around the globe.
Today, we are announcing the Flickr Creative Commons dataset as part of Yahoo Webscope’s datasets for researchers. The dataset, we believe, is one of the largest public multimedia datasets that has ever been released—99.3 million images and 0.7 million videos, all from Flickr and all under Creative Commons licensing.
The dataset (about 12GB) consists of a
jpeg url or
video url, and some corresponding metadata such as the
tags. Plus about 49 million of the photos are geotagged! What’s not there, like comments, favorites, and social network data, can be queried from the Flickr API.
But of course, processing 100 million images takes a fair bit of processing power, time, and resources that not every research institute has. To aid here, we’ve worked with the International Computer Science Institute (ICSI) at Berkeley and Lawrence Livermore National Laboratory to compute many open standardized computer vision and audio features†, which we plan to host in a shared Amazon Instance, as it’s somewhere north of 50TB, for researchers around the world to use. It’s pretty intense and they brought in a first-of-its-kind supercomputer, the Cray Catalyst, to make the calculations.
The dataset can host a variety of research studies and challenges. One of the first challenges we are issuing is the MediaEval Placing Task, where the task is to build a system capable of accurately predicting where in the world the photos and videos were taken without using the longitude and latitude coordinates. This is just the start. We plan to create new challenges through expansion packs that will widen the scope of the dataset with new tasks like object localization, concept detection, and social semantics.
Interested? Head over to the Yahoo Webscope site to request the dataset. If you have any questions, you can get those answered there as well.
It is in the spirit of collaboration and desire to discover answers to complex problems that we are excited to announce the recipients of the 2014 Yahoo Faculty Research and Engagement Program (FREP) award. This academic outreach initiative is designed to produce the highest quality scientific collaborations and outcomes by engaging with faculty and students conducting research in areas of mutual interest. The FREP awards hundreds of thousands of dollars in unrestricted gifts to support new, exciting Internet research studies and experiments between academics across the globe and their Yahoo research scientist counterparts.
Over the course of the next year and beyond, FREP award recipients and Yahoo Labs scientists will work closely to further research in their mutual areas of interest. Yahoo Labs Research Scientist Mihajlo Grbovic will be working with Stanford University Assistant Professor Jure Leskovec on the network-based detection of potentially compromised accounts in Yahoo Mail and Tumblr. Mihajlo and Jure further discuss their research and the FREP award here:
We were extremely impressed with all of the submissions and would like to thank each professor who applied. Congratulations to the following recipients of the Yahoo 2014 Faculty Research and Engagement Program:
At Yahoo Labs, we’re committed to forging strong alliances with top faculty by collaborating on cutting edge research to advance Web Science. These collaborations will solve shared problems with measurable outcomes such as joint papers, advances in algorithm design, systems research, digital media studies, and marketplace design. FREP supports all areas of Yahoo Labs research.
If you have questions about the Faculty Research and Engagement Program, please contact Kim Capps.