Yahoo Donates 200 Servers to Georgia Tech


By Seth Tropper

The Georgia Tech Yellow Jackets have something new to buzz about this week. On Monday, Yahoo gifted Georgia Tech 200 servers! These servers, donated through our Y-STAR program (Yahoo Servers To Academic Researchers), will significantly contribute to Georgia Tech’s education and research activities by enabling a large-scale, hands-on education experience for undergraduate and graduate students, and opening up new possibilities for data-intensive research.

Supporting the academic community is one of the top priorities at Yahoo Labs; and the vision of the Labs’ Academic Relations group is to help Yahoo be one of the first companies that academics think of for new research or business collaborations, employment opportunities, inspiring and entertaining online products, open publications, tools, and datasets.

This donation represents yet another milestone between Yahoo Labs and Georgia Tech’s collaboration efforts. Yahoo continues to maximize the impact of its partnership with the scientific community by building deep relationships with Georgia Tech and other select research institutions, and fostering collaborative research through multiple dynamic mechanisms – including unique world-class datasets, diversity and event support, bilateral visits and guest talks, project grants and merit-based awards.

These servers will support and enable increased research and education efforts at Georgia Tech, in the College of Computing, College of Engineering, and several other schools. Georgia Tech has several courses on parallel computing, high-performance computing, data and visual analytics, and these servers donated by Yahoo will significantly broaden the scope of these courses, and attract more students from different backgrounds to take them. Many students have already expressed strong interest in using the servers for research and development, and for high-volume data analysis in machine learning and graph mining.

Georgia Tech has already targeted some potential applications for these servers, which may include:

·      parallel, distributed computing problems, projects and platforms (e.g., Hadoop, Spark, HBase, etc.)

·      data-intensive research on sustainability, bioinformatics, healthcare analytics, simulation and modeling, and numerical computing

·      power modeling of energy-efficient, large-scale computing

·      hybrid computation approaches that bridges the field of high-performance computing and data-intensive computing

The Y-STAR Program has been made possible by a partnership between two Yahoo organizations: Labs Academic Relations and Technology Operations. Rather than dispose of older Yahoo servers, Y-STAR was established to refurbish and donate them to university researchers as enablers of cutting-edge academic research. This initiative began just three years ago, and including this donation of 200 servers to Georgia Tech, Y-STAR has donated more than 3,400 servers globally (~2,900 in the US) valued at just under $1.4M. This program has been very well received by faculty and students; has enhanced collaboration between top academic researchers and Yahoo scientists and engineers; and has greatly enhanced Yahoo’s presence and visibility on campus.

The Ins and Outs of the Yahoo Flickr Creative Commons 100 Million Dataset

One Million Creative Commons Geo-tagged Photos A 1 million photo sample of the 48 million geotagged photos from the dataset plotted around the globe. Creative Commons License One Million Creative Commons Geo-tagged Photos by aymanshamma on Flickr.

By Bart Thomée and David A. Shamma

This past summer we released the largest and most ambitious collection of Flickr photos and videos ever, containing an incredible 99.2 million photos and 0.8 million videos. If you missed our earlier announcement, you can read all about it here. We’re super-excited about the dataset, because it is a reflection of how Flickr has evolved from its inception to today. Since its release we’ve received a lot of emails and tweets asking for more details about the dataset, so to satisfy your curiosity we’ve just published a new blog post in which we describe all of the ins and outs: we look at the photos, videos, tags, cameras, users, locations and licenses included in the dataset. Please head to our post on the Flickr Code blog to dive into the details!

Science Powering Product and Personalization: Going Beyond Clicks

By Liangjie Hong and Suju Rajan

At Yahoo, we invest a lot of effort in generating, retrieving, and presenting content that is engaging to our users. The hypothesis being that the more users are truly engaging with the content in a site, the more they will return to it. Now, what is user engagement on media articles, for example, and how can one measure it? More importantly, how do we design learning algorithms that optimize for such user engagement?

When consuming frequently updated online content, such as news feeds, users rarely provide explicit ratings or direct feedback. Further, their clicks and related metrics such as item-level click-through rate (CTR) as implicit user-interest signal do not always effectively capture post-click user engagement. For example, users may have clicked on an item by mistake or because of link bait but are truly not engaged with the content being presented (e.g. A user may immediately go back to the previous page or may not scroll down the page). It is arguable that leveraging the noisy click-based user engagement signal for recommendation can achieve the best long term user experience. Thus, it becomes critical to identify signals and metrics that truly capture user satisfaction and optimize these accordingly.

In a just-published paper1 at the ACM Conference Series on Recommender Systems (RecSys 2014), we present an important yet simple metric to measure user engagement: the amount of time that users spend on content items, or dwell time. Our results indicate that it is a proxy to user satisfaction for recommended content, and complements or even replaces click-based signals. However, utilizing dwell time in a personalized recommender system introduces a number of challenges.

In order to understand the nature of dwell time, we analyzed user interaction data from traffic on the stream on the Yahoo home page. Our findings appear to match intuition. Here are a few:

  • Users have less dwell time per article on mobile or tablet devices than on desktops.

  • Users spend less time on slideshows than on articles.

  • Users dwell more on longer articles, up to 1,000 words. Beyond that limit, there is very little correlation to article length.

  • Users dwell the most on articles in the topics of politics or science and the least on articles on food or entertainment.

Now, different users exhibit different content consumption behaviors even for the same piece of content on the same device. In order to extract comparable user-engagement signals that is agnostic to both the context of the user as well as the item, we introduce the concept of normalized user dwell time. In essence, we normalize out the variance in the dwell time due to the differences in the context. For more details, on how we measure dwell time and on its normalization, please refer to our paper. image

 The relationship between the average dwell time and the article length where X-axis is the binned article length and the Y-axis is binned average dwell time.

More interestingly, when we used the normalized dwell time instead of clicks as the optimization target in our recommendation algorithms, we manage to improve performance. This performance gain was observed both in the learning to rank as well as the matrix factorization formulations. From online tests, we also found that optimizing dwell time not only achieves better user engagement metrics but also improves CTR. One plausible reason is that when optimizing towards dwell-based signals rather than high CTR, users may like the content recommended better, come back to the site and click some more.

This research from the Personalization Science team at Yahoo Labs shows that content recommendation models perform better when using dwell based metrics. Whether these metrics in turn drive long term user engagement such as the number of days a user returns to our site, is another interesting research question that we have studied as well. So, stay tuned for our follow-on post on whether optimizing for dwell time truly drives long term user engagement.

Work done by Xing Yi, Liangjie Hong, Erheng Zhong, Nathan Liu, and Suju Rajan

2014 Yahoo ACE Award Recipients Selected

At Yahoo Labs, we highly value collaboration and an open research environment. For that reason, our Academic Relations team creates deep relationships with leading universities and professors to nurture strong scientific partnerships. It is our experience that these meaningful joint efforts lead to pioneering innovations that improve the Internet generally, and the Yahoo experience more specifically, in both evolutionary and revolutionary ways.

With the goal of furthering our academic collaborations and their productive outcomes, it is our pleasure to announce our 2014 Yahoo ACE (Academic Career Enhancement) Award recipients for the 2014-2015 academic year. These are five top young professors at leading research universities around the world who are competitively selected among many promising first- and second-year faculty members conducting Yahoo-relevant academic research. The award includes an unrestricted monetary gift which may be used in any way the recipients see fit to help get their academic careers off to a great start. Previously, funds have been used toward purchasing research-related hardware and software, as well as hiring students to work on research projects.

This year’s ACE recipients include:

Professor Daniel Hsu, Columbia University

Daniel Hsu

Daniel Hsu is an Assistant Professor in the Department of Computer Science and an affiliated member of the Institute for Data Sciences and Engineering, both at Columbia University. Previously, he was a postdoc at Microsoft Research New England, and the Departments of Statistics at Rutgers University and the University of Pennsylvania. He holds a PhD in Computer Science from UC San Diego, and a B.S. in Computer Science and Engineering from UC Berkeley.

His research interests are in algorithmic statistics, machine learning, and privacy.

Professor Hongning Wang, University of Virginia

Hongning Wang

Hongning Wang is an Assistant Professor in the Department of Computer Science at the University of Virginia. His research interests include data mining, information retrieval, and machine learning, with a particular focus on computational user modeling and knowledge discovery. He has published over 20 research papers on these topics in top data mining and information retrieval venues, including KDD, WWW, SIGIR and WSDM. He is the recipient of the 2012 Google PhD Fellowship in Search and Information Retrieval, and 2012 Yahoo Key Scientific Challenges Award in Web Information Management. He has served on program committees for several major conferences such as ICML, ECML/PKDD, and ECIR, and reviewed for multiple journals, including IEEE TKDE, ACM TOIS, Neurocomputing and BMC Bioinformatics.

Professor Jia Deng, University of Michigan

Jia Deng

Jia Deng is an Assistant Professor of Computer Science and Engineering at the University of Michigan. His research in computer vision focuses on image and video understanding through big visual data, human computation, and large-scale machine learning. He has built datasets and tools used by over 1,000 researchers around the world. His work has won the ICCV Marr Prize and the ECCV Best Paper Award, and has been featured in popular press such as the New York Times and MIT Technology Review. He received his PhD from Princeton University and his B.Eng. from Tsinghua University, both in computer science. He has been co-organizing the ImageNet Large Scale Visual Recognition Challenges (ILSVRC) since 2010. He was also the lead organizer of the BigVision workshops at NIPS 2012 and CVPR 2014.

Professor Jinho Choi, Emory University

Jinho Choi

Jinho Choi is an Assistant Professor at the Department of Mathematics and Computer Science and an Assistant Professor at the Institute of Quantitative Theory and Methods at Emory University. Jinho’s research focuses on the optimization of natural language processing for “robustness” on various data and “scalability” on large data. The goal is to develop NLP components that are readily available for more higher-end research. All the NLP components (e.g., dependency parser, semantic role labeler) are developed in ClearNLP, an open source project that has been widely used for academic and industrial research.

Another part of Jinho’s research focuses on NLP applications such as question answering, information extraction, dialog management, etc. These applications are often domain specific; the goal is to develop applications that work well enough to be practical for certain domains (e.g., FAQ for a company, entities in social media, topics in news), and keep expanding these domain as needed. Constructing meaning representation from texts is a big part of this research.

Professor Theophilus Benson, Duke University

Theophilus Benson

Theophilus Benson is an Assistant Professor in the Computer Science Department of Duke University. His research interests include solving practical networking and systems problems, with a focus on Software Defined Networking, data centers, clouds, and configuration management. In the past, Theophilus has conducted large scale measurement studies of data centers and enterprise networks; and he has developed several networked and distributed systems — one of which was purchased in 2012. To date, his study on data center traffic characteristics has been used by over 15 groups to evaluate their designs and architectures.

Ben Shahshahani Returns to Yahoo Labs as VP of Advertising Sciences

By Ron Brachman

I am pleased to announce the return of Ben Shahshahani to Yahoo Labs as our new Vice President of Advertising Sciences in the United States.

Yahoo Labs is home to the company’s most forward-looking thinkers, providing deep technical expertise on scientific and technical topics of critical importance to Yahoo’s future. Advertising sciences is a crucial area for Yahoo, and Ben will lead our efforts to understand fundamental principles and create innovative technology essential to connecting advertisers to the right audiences at the right time. With Ben’s guidance, the team will focus on many key advertising-related scientific subjects, including, for example, efficiency, relevancy, engagement, ad effectiveness, marketplaces, and increasing advertiser ROI.

An accomplished engineer and researcher, Ben has years of experience that will serve him well in his new role. Most recently he served as an Engineering Director in Google’s Display Advertising team. There he managed the Display Campaign Optimization team.

Until mid-2012, Ben held the position of Vice President of Search and Media Sciences within Yahoo Labs. Among his many other responsibilities, Ben was at the helm of the development of algorithms that powered Yahoo’s search and media products including user modeling/profiling, data mining and recommendation systems, query and content processing, relevance ranking for vertical search, search assist, and page layout optimization. Ben was a long-time Yahoo, starting back in 2006.

Before Yahoo, Ben was a Research Scientist and Director of Natural Language Processing at Nuance, and part of the Speech Processing Group at IBM. Ben holds a PhD in Electrical Engineering from Purdue University and has over a dozen issued and filed patents related to online advertising, search, natural language, and speech processing.

On a personal note, I am thrilled Ben is coming home to Yahoo. Those who worked with Ben share my enthusiasm, and our many new faces ready to greet him will benefit from his leadership and expertise. We have had an incredibly exciting year at Yahoo Labs, and I couldn’t be happier to continue that momentum with Ben’s arrival.

Tackling Natural Language Generation… at Scale


By Amanda Stent

If you have used a smartphone personal assistant then you would probably agree a computer has talked to you in a “natural language” like English or Spanish.  However, it may surprise you to learn that the same is true if you have checked your email, used a shopping website, checked the weather online, tweeted with a company, or looked up directions on the Web.  In fact, the Internet today is full of a mishmash of human- and computer-generated language.

How do computers generate language?  Modern natural language generation (NLG) systems operate over raw numerical data, structured databases, or text input.  They generate language for a great variety of useful applications, including weather forecasting, financial and healthcare report generation123, and review summarization4.  They produce output using one of three basic methods.  The first, and by far most widely used, is template-based generation:  a human writes natural language text with gaps, and the computer fills the gaps in from dictionaries.  If you’ve received a form letter from a company, that was template based generation.  The second type of natural language generation is grammar based: a human writes a set of rules covering the structure of a natural language, and the computer processes the rules to produce natural language.  Example grammar-based NLG systems are the open source SimpleNLG and OpenCCG systems.  The third approach to natural language generation is statistical: the computer “reads” a lot of text (e.g. from the Web) and learns the patterns with which people write or speak.  Then it can produce those patterns.  A variation on statistical natural language generation that allows for more control uses a simple set of rules specifying the structure of the language to produce many possible outputs, and then a statistical model of text to rank those outputs so the most “human like” one can be selected.

Now let’s imagine that you wanted to make a system that talked with human users using natural language.  For example, you might want to make a mobile app that recommended restaurants, that helped users change their bad habits, that compared the stats of football players for a fantasy football league, or that played a character in a mobile game.  What would you want the NLG system to do in each case?  At a minimum, you would probably want the system to produce correct, grammatical and natural prompts and responses, in an efficient manner; that is, you would want the system’s output to capture the content of the input accurately, to be easily understandable by a human, and to appear in a reasonable amount of time. These are standard NLG evaluation metrics.

One could argue that there is more than enough natural language on the Web to give any computer a correct, grammatical and natural output for almost any input, i.e. if we can learn the mappings from language inputs to knowledge representations, we never have to build an NLG system again.  However, if you wanted to use the NLG system in an interactive context, such as encouraging users while they exercise or playing a character in a game, you would probably also want several other, less obvious, things from your NLG system.  For example, you might want the system to exhibit controlled variation.  Specifically, you might want the system to adapt its output to the context and to the user (e.g. not keep saying ‘Peyton Manning, the quarterback’ when ‘Manning’ would work for a football fan, or not say ‘Rob’s Bistro, 234 Main Street, Madison’ when the user is right across the street and it could just say ‘Rob’s Bistro, in front of you’).  In addition, if the system is representing a company or a character in a game, you might want it to exhibit personality; a villain interacts differently than a hero, and different companies have different corporate personalities. And finally, if the system is very interactive, you would want it to have good ways to manage the interaction – for example, good ways to handle errors and ambiguities. Several of these new metrics arise directly from the interactive nature of the application – essentially, you want users to be sufficiently engaged with the system that they continue the interaction. The problem is that these additional desiderata are easy for humans to understand but hard to quantify and model in a computer program, and especially so in the absence of user feedback.  We need methods for NLG that allow us to model the complexities of interaction as well as take advantage of the many sources of language data on the Web.

What is the big goal for NLG systems for interaction? What would allow us to say this AI problem had been ‘solved’? And what about the science of NLG - how can we use NLG systems to further understand human intelligence?  The famous Turing test is a test of an interactive NLG system, but in some ways the test is oddly limited – the system and human are not co-present and can interact only through text, so the interaction does not take into account physical context or the user’s history; the task is a sort of trivia quiz, so the user may not care deeply about success; and there is no social or emotional engagement element, so only a small aspect of human intelligence is examined.  At the same time, the famous experiments with the Eliza chatbot showed how easily humans can be fooled about human intelligence. What if we proposed new tests, e.g. a computer system that could convince a user to buy a product, or a virtual standup comedian?  Both of these applications involve task-related intelligence, conversational intelligence and social intelligence.  Or how about an interactive system that could be so helpful and engaging that a user would choose it over a human personal assistant?

At Yahoo, we are all about creating fun and personalized interactions to support users’ daily habits, and consequently we care deeply about issues of adaptation and engagement.  Our applications run the gamut from asynchronous interaction (e.g. Yahoo Answers, Yahoo Groups) to situated interaction (e.g. Yahoo mobile search, Aviate).  Furthermore, at Yahoo Labs we have the ability to run experiments at scale, allowing us to automatically identify the subtle features of language use that correspond, for example, to ‘helpful’ adaptation, to ‘informative’ answers or to a ‘fun’ personality.  If you are a graduate student or faculty researcher interested in questions around NLG for interaction, we invite you to contact us – we would love to collaborate. Help us design interactive systems for the future that are engaging (e.g. fun, dramatic, beautiful) as well as useful.


Di Fabbrizio, G., Stent, A., & Gaizauskas, (2013) Summarizing opinion-related information for mobile devices. In Neustein, A. & Markowitz, J. (eds). Mobile Speech and Advanced Natural Language Solutions. Springer.

Dr. Ben Shneiderman Engages With Data Visualization In Big Thinkers Talk


Last week we were honored to have had Dr. Ben Shneiderman, Professor of Computer Science and Founding Director of the Human-Computer Interaction Laboratory at the University of Maryland, present a Big Thinkers talk at Yahoo entitled, “Information Visualization for Knowledge Discovery: Big Insights from Big Data.” During his presentation, Dr. Shneiderman focused on the importance of visualization tools in answering Big Data questions and solving Big Data problems. Shneiderman enthusiastically stated that “visualization is a way of engaging people,” and that “visualizations give you answers to questions you didn’t know you had.”

Professor Shneiderman also covered his “8 Golden Rules of Data Science”:

  • Choose actionable problems and compelling theories
  • Open your mind: domain experts and statisticians
  • If you don’t have questions, you’re not ready
  • Clean, clean, clean… your data (gently on the screen)
  • Know thy data: ranges, patterns, clusters, gaps, outliers, missing values, uncertainty
  • Evaluate your efficacy, refine your theory
  • Take responsibility, reveal your failures
  • Work is complex, proceed with humility

The event was broadcast live on our homepage and viewers had the opportunity to ask questions and comment on our Twitter stream @YahooLabs as well as our Facebook page.

You can view Dr. Shneiderman’s full presentation here:

Machine Learning for (Smart) Dummies

By Aryeh Kontorovich

So how do you tell a cat from a dog? It’s something a three-year-old does with near-perfect accuracy, but not something we can formalize in simple rules or easily write code for.

When searching for “cats that look like dogs,” here’s what comes up:


Traditional Artificial Intelligence (AI) attempts to produce clean, interpretable decision rules. Modern machine learning takes this a step further: Rather than trying to feed computers man-made rules, we hope computers will discover their own rules based on examples so that many tasks requiring human input will become fully automated. In other words, we need machines to learn.

Yahoo scientists and engineers are faced with solving numerous learning-related problems. As a visiting scientist from Ben Gurion University who has, by now, spent some time working alongside Yahoos, I have come to respect the amount of hands-on experience my colleagues have with standard machine learning algorithms: SVM, boosting, nearest neighbors, decision trees…. Of course, many of these algorithms are simple and intuitive (such as, “Given a test point, predict the label of the closest training point”). But, their mathematical underpinning is not always well understood.

Using my background in theoretical machine learning research, I instructed a recent seven-week course at Yahoo with the aim of providing a theoretical foundation on which the aforementioned algorithms are based. Why does a large margin guarantee good generalization? How does one avoid overfitting? What are the “no free lunch” results in learning? What is the best learning rate one could hope for? Using rigorous mathematical tools, the course provides answers to these questions.

I am firmly of the conviction that “there is nothing so practical as a good theory.” My hope is that deep insight into common learning algorithms will give practitioners a better sense of which ones are more applicable in any given situation, and perhaps even guide other scientists, engineers, etc. in designing novel approaches.

One of the benefits of the open academic collaborations that Yahoo Labs encourages, including mine, is the knowledge transfer each party brings to the table. It is in the same spirit of collaboration and open discourse that we are offering all of the seven classes below for your professional and/or personal enrichment. I hope you find them useful.

Week 1:

Week 2:

Week 3:

Week 4:

Week 5:

Week 6:

Week 7:

Are you ready for some Tumblr data-driven football?

by Nemanja Djuric, Vladan Radosavljevic, and Mihajlo Grbovic

The summer of soccer is behind us, and sports fans across the U.S. can finally turn their attention to real football (that is, American football). After more than seven months of silence, the NFL stage is set for a new season of blood, sweat, and data. Yes, data.

Everybody hopes to see his or her favorite team clinch the coveted Vince Lombardi Trophy, but data-driven predictions are another matter. And predictions are all the more fun when you add social media to the mix.

Following the success of our World Cup predictor where we correctly forecasted three out of four semifinalists using specific Tumblr chatter, Yahoo Labs is once again using the power of data science to bring you an answer to the only question that really matters this season: Who will win? Our statistical analysis includes Tumblr posts from May through August, which we used to create a machine learning predictor based on the popularity of each team and its players according to Tumblr’s 200+ million blogs.


The first step in creating our predictor was to isolate NFL-related Tumblr posts using NFL-related hashtags, including #nfl, #american_football, #offseason, and #football, found through state-of-the-art tag-clustering technology. Then, we counted the number of team mentions in those posts using only their short names (e.g., Eagles or 49ers) as a measure of popularity of the given team on the social network. In addition, we searched all the Tumblr content for full team names (e.g., Philadelphia Eagles or San Francisco 49ers). The popularity of the teams computed in this way is represented by the following two graphs: image

Further, we took the players from each team and computed each player’s individual popularity on Tumblr. Finally, we combined the aforementioned calculations with NFL game outcomes from 2013 and trained two statistical models that separately predicted the number of touchdowns and fields goals each team would score against its opponent, factoring in whether a team plays at home or away. For more details about the mathematics behind our approach, please see “Goalr! The Science of Predicting the World Cup on Tumblr” and our associated technical paper.

When we put this plethora of data together, we were able to calculate the winner of every game in the 2014 season, as well as the overall Super Bowl champion. And, in answer to the initial question, we determined the Tumblr community believes the Denver Broncos will reign victorious. Don’t agree? Then make your voice heard on Tumblr and you could change the outcome. Let the games begin!


Week 1 schedule and predicted results: