A Snapshot of WWW 2014

Last week over a dozen Yahoo scientists and technical experts attended the 23rd International World Wide Web Conference (WWW 2014), which Yahoo Labs was proud to sponsor. Yahoo Labs scientists presented their latest Web Science research, gave talks, spoke on panels, and collaborated with peer academics.

You can find all of our photos and videos from the week-long conference on our Flickr page. Here’s a short video blog from Communications Manager Mike Sefanov in Seoul, S. Korea, where WWW 2014 was hosted:

Bad Weather Equals Negative Restaurant Reviews

A new study from Georgia Tech and Yahoo Labs researchers shows that weather and demographics play a significant role in determining how positively or negatively someone will review a restaurant online. In short, bad weather equals negative reviews.

Saeideh Bakhshi (a Yahoo Labs intern and Georgia Tech PhD candidate), Partha Kanuparthy (a Yahoo Labs scientist who was a student at Georgia Tech while conducting the research), and Georgia Tech professor Eric Gilbert, co-authored the study which will be presented at the 23rd International World Wide Web Conference in Seoul, South Korea on April 10.

The Georgia Tech team analyzed data from Foursquare, Citysearch, and TripAdvisor, including 840 thousand restaurants and their 1.1 million associated reviews from 2002 to 2011, spread across every U.S. state. They found that “while endogenous factors such as restaurant attributes (e.g., meal, price, service) affect recommendations, surprisingly, exogenous factors such as demographics (e.g., neighborhood diversity, education) and weather (e.g., temperature, rain, snow, season) also exert a significant effect on reviews.”

Restaurant reviews written while the temperature is between 70◦ F and 100◦ F tend to be the most positive. Reviews written in temperatures below 40◦ F and above 100◦ F tend to be more negative. Additionally, reviews are more negative when it is raining or snowing.

The authors’ research also finds that demographics surrounding a restaurant are perhaps as influential as the weather. In more racially diverse neighborhoods and/or in those with residents who have college-level education, the average restaurant rating is higher.

The study is the first to look at exogenous factors and how they relate to online restaurant reviews. The work carries possible implications for designing online recommendation sites and for restaurants in general.

You can read the full paper, “Demographics, Weather and Online Reviews: A Study of Restaurant Recommendations,” on the Yahoo Labs website.

Some Research on Tumblr (on Tumblr)


A just-published article from MIT Technology Review called, “The Anatomy of a Forgotten Social Network,” discusses research on Tumblr from research scientists on our Search and Anti-Abuse team.

Find out what the average Tumblr post length is, how long the “average distance” between two users is, and many more fascinating insights.

Read the article here and get the full paper, “What is Tumblr: A Statistical Overview and Comparison,” here.

This kid is about to get so many Instagram likes. A new study led by researchers at the Georgia Institute of Technology and Yahoo Labs (disclosure: we are also owned by Yahoo. What’s up, Yahoo Labs?) found that Instagram photos with a face in them are 38 percent more likely to be liked and 32 percent more likely to spur conversation than images without a face. The study evaluated a random sample of 1 million Instagram photos to determine if human faces (and their age and gender) influenced how people responded on the social network. According to Saeideh Bakhshi, the Georgia Tech College of Computing Ph.D. who led the study, her team anticipated that photos of people, rather than things, would receive a more enthusiastic response. High-res

This kid is about to get so many Instagram likes. A new study led by researchers at the Georgia Institute of Technology and Yahoo Labs (disclosure: we are also owned by Yahoo. What’s up, Yahoo Labs?) found that Instagram photos with a face in them are 38 percent more likely to be liked and 32 percent more likely to spur conversation than images without a face. The study evaluated a random sample of 1 million Instagram photos to determine if human faces (and their age and gender) influenced how people responded on the social network. According to Saeideh Bakhshi, the Georgia Tech College of Computing Ph.D. who led the study, her team anticipated that photos of people, rather than things, would receive a more enthusiastic response.

Dr. Rafail Ostrovsky Offers Techniques for Boosting Cloud Security

We are fortunate to have had Dr. Rafail Ostrovsky, Professor of Computer Science and Mathematics at UCLA, present a talk on ”Cloud Security: Threats, Challenges and Solutions” today at Yahoo. Professor Ostrovsky discussed the importance of cryptography and some of the many capabilities of the scientific discipline in cloud security. In his talk, Ostrovsky gave examples of how traditional encryption of data in the cloud can be breached. He then followed by offering recent techniques for boosting security and enabling new functionalities.

The event was broadcast live on our labs.yahoo.com homepage and viewers had the opportunity to ask questions and comment on our Twitter stream @YahooLabs as well as our Facebook page.

You can view Dr. Ostrovsky’s full presentation here:

Welcome to Webscope

Are you an academic in need of data for your research? How about a LOT of data? If your answer is ‘yes’ (and ‘yes’), then you’ve come to the right place.

Yahoo is one of the largest Internet destinations on the planet. So it goes without saying that we have an immense amount of varied data. At Yahoo Labs, we constantly strive to advance the state of knowledge and understanding in web sciences. We very much believe that one of the best ways to progress is by being collaborative and open. That is why we are very pleased to now share our 50th dataset in our Webscope program.

The Yahoo Webscope Program is a reference library of interesting and scientifically useful datasets for non-commercial use by academics and other scientists. All datasets have been reviewed to conform to Yahoo’s data protection standards, including strict controls on privacy. We offer data in the following categories: Graph and Social Data, Ratings and Classification Data, Advertising and Market Data, Competition Data, Computing Systems Data, Image Data, and Language Data.

Our newest dataset is Yahoo Search Query Log To Entities. With this dataset you can train, test, and benchmark entity linking systems on the task of linking web search queries – within the context of a search session – to entities. Entities are a key enabling component for semantic search, as many information needs can be answered by returning a list of entities, their properties, and/or their relations. A first step in any such scenario is to determine which entities appear in a query – a process commonly referred to as named entity resolution, named entity disambiguation, or semantic linking. 

The Yahoo Search Query Log To Entities dataset allows researchers and other practitioners to evaluate their systems for linking web search engine queries to entities. The dataset contains manually identified links to entities in the form of Wikipedia articles and provides the means to train, test, and benchmark such systems using manually created, gold standard data. By releasing this dataset publicly, we aim to foster research into entity linking systems for web search queries.

To date, we have accommodated nearly 12,000 requests for datasets at over 1,300 universities in 94 countries. You can learn more about Webscope and request our datasets here. We hope you find this resource useful.

Yahoo Academic Relations Joins Yahoo Labs on Tumblr

After two years of posting Academic Relations news on our Yahoo On Campus Blog, we’re happy to share upcoming academic-related posts with Yahoo Labs on Tumblr. All of our previous content is still on the old blog, so if you’re feeling nostalgic, you can always visit. But from now on, we’ll bring you the latest on our campus connections here. 

So keep this page bookmarked — we’ll be bringing you lots of good stuff soon!

Our Exciting New Partnership with Carnegie Mellon University

By Ron Brachman, Chief Scientist and Head of Yahoo Labs

Last year was a wonderful year for Yahoo Labs. Among our most exciting accomplishments was our success in hiring more than 50 new PhDs to join our research teams around the world. But that was just a first step for us; we’re always looking for outstanding people to complement our team of cutting-edge computer scientists and research engineers, and work with us to push the boundaries of research and innovation at Yahoo. Today, we’re thrilled to announce a new five-year, $10 million partnership with Carnegie Mellon University (CMU) that aims to do just that. There are many things that are novel about this new collaboration, and we couldn’t be more excited.

All of us here at Yahoo Labs can’t wait for the opportunity to work with the exceptional faculty and students at Carnegie Mellon, which has established itself as a premier institution for machine learning and human-computer interaction research, and these are the main focal areas of our partnership. As part of this partnership, we’re creating a way for CMU researchers to work directly with Yahoo’s software and infrastructure. This should allow us to speed up the pace of personalization research, especially in a mobile setting, and ultimately create a better user experience for our hundreds of millions of users.

One highlight of the partnership is an industry-first mobile toolkit that will enable CMU researchers to easily experiment with Yahoo’s real-time data services, letting them test new ways that machine learning and interface technologies can improve personalized user experiences. We like to think of this as part of a grand-scale living laboratory where researchers can explore new approaches to understanding human behavior through machine learning and interface technologies. Members of the CMU community who opt in to test the experimental mobile software will provide researchers access to real user data and the opportunity to iterate rapidly on key technologies.

We are also unveiling a new Yahoo-sponsored fellowship program that will provide financial and research support to CMU computer science students and faculty members. Yahoo Fellows will have the opportunity to conduct research across a variety of advanced computer science disciplines with annual financial support from Yahoo and mentorship from world-class computer scientists at Yahoo Labs and CMU. Among other things, we’re looking forward to highly productive visits by the Yahoo Fellows to our Sunnyvale and New York research labs, as well as lots of interactions during Yahoo visits to Pittsburgh.

CMU President Dr. Subra Suresh is equally enthusiastic: “This partnership is a clear demonstration, in the tradition of CMU, of how scholarly scientific research combined with industry relevance and perspectives could advance technologies that have a global social impact.”

The partnership, called “InMind,” will be directed at CMU by Dr. Tom Mitchell, Fredkin University Professor of Computer Science and Machine Learning and head of the Machine Learning Department, and by Dr. Justine Cassell, the Charles M. Geschke Director of the Human-Computer Interaction Institute.

Finally, a more personal note: I am extremely happy to be working again with my long-time friend and colleague, Tom Mitchell. Tom and I are a couple of hard-core AI guys who go way back. We’ve had wonderful opportunities to work together in the past, including both serving in leadership positions in AAAI together (Tom was President from 2001 to 2003, when I served as President-Elect, and my term as President started as he stepped down and became Past President). Tom was also a major leader and intellectual influence in our DARPA PAL (Perceptive Assistant that Learns) program, on both the highly-successful CALO and RADAR projects. Tom has been a worldwide thought leader in machine learning for many years, and his leadership in creating the Machine Learning Department and running it since its inception has been spectacular. We’ve been brainstorming this new collaboration together for many months, and we are both incredibly excited to get it off the ground.

A Linguistic Analysis of the State of the Union Address

By Yuval Pinter 

Reading Mark Liberman’s analysis of Obama’s SOTU addresses versus other presidents’, my thirst remained unquenched. Word-counts are fun, sure, but the real fun comes in when looking at longer phrases – two (bigrams) or three (trigrams) words long.

After waiting for it to be breakfast time in Philadelphia, I engaged in an experiment (Legal has advised me against explicit use of MYL’s trademark phrase) to analyze the 228 addresses (found here) and see what Obama’s favorite (and least-favorite) phrases are.

Since I worked with raw data, I handled it a bit differently than previous analyses just for the sake of getting results fast. To begin with, I did not weed out the non-orally-delivered addresses or any other “special” cases. Next, I used an unsophisticated tokenization algorithm where all apostrophes break words into tokens (so “Congress’s” is split in two, as in Liberman’s analysis, but same goes for “i’m” and “he’s”). Lastly, I used a comparison algorithm which only takes into account Obama’s speeches and all addresses (1790-2014) as “background”: the KL measure, which purports to tell us how “informative” the phrase is in the Obama corpus relative to the background corpus.

Let’s get to it: here are Obama’s most unexpectedly frequent bigrams:

We see many stylistic markers here, such as the contracted forms “‘s”, “‘re” and “‘ll”, which will probably re-appear in any modern president’s lingo (with not much to support either the egocentric-Obama or collective-Obama hypotheses), but these expected bigrams greatly emphasize the magnitude of the more content-swayed ones: “our economy”, “middle class”, “health care” and the number one issue on Obama’s plate (at least according to Kullback and Leibler): “clean energy”.

Obama’s most unexpectedly infrequent bigrams: (for these, I still only took phrases which appeared somewhere in Obama’s addresses)

And the rest is just as boring. We’ve seen “the” is on the decline, and it drags down all its associated bigrams with it.

Moving on. Favorite trigrams: (“PAR” marks the beginning of a paragraph)

So the top three are explanation starters, but check out “democrats and republicans” creeping in to a bipartisan content-lead. And you may take what you will from number 25, beginning paragraphs with “of course”.

Least favorite trigrams:

A bit more interesting than the lost bigram table. “the american people” made it to the top, but “the people of” are on the bottom, suggesting nothing but a stylistic anomaly (or shift) in denoting what is probably the group which is most referred to in these addresses. How “the united states” and “states of america” got to opposite ends is beyond me, though. Much to look into, perhaps during some breakfast after next year’s SOTU.

You can also find this post on the Language Log.