Goalr! The Science of Predicting the World Cup on Tumblr

By Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, and Narayan Bhamidipati

With the 2014 FIFA World Cup kicking off on June 12, billions of fans across the world are turning their attention toward host country Brazil to root for their favorite teams. Soccer (or, if you prefer, football) fans are loud; you need only remember the last World Cup’s infamous vuvuzelas for a demonstration. But fans aren’t only loud in stadiums. They also make their voices heard across social media. And though you may assume these fans are just blowing their vuvuzelas into the social abyss, if you listen closely, you’ll discover a treasure trove of data — including possibly an answer to the most important question of all: Who will win?

As soccer fans and Yahoo Labs scientists with access to Tumblr data, we wanted to find out if we could take advantage of our unique insight to comb through an ocean of posts to predict a World Cup winner. And we have! But before we share our prediction on which nation will get to revel in World Cup glory, we’ll tell you how we figured it out.

Sifting through 188.9 million Tumblr blogs comprising 83.1 billion posts to find World Cup-related content wasn’t easy. To begin, we used two main parameters to determine which content was relevant: posts with hashtags referencing #WorldCup, #World Cup, #Copa do mundo (or other variants outlined in our technical report), and posts with hashtags referencing #soccer, #football, #futbol, etc.

However, using these parameters alone proved too broad. So once we isolated #WorldCup-related posts, we checked the bodies of the posts for mentions of country names. Then we did the same for #soccer-related posts (for Team USA, we counted only mentions in #soccer posts to avoid confusion with American football. For Team Brazil, we discounted a percentage of posts due the country hosting the event and thus receiving extra mentions — this was a percentage calculated based on an editorial evaluation on a sample of posts).

image

To get even more representative results, we checked the bodies of posts in both hashtag categories for mentions of any national team player according to FIFA’s official list of players for each nation.

image

Upon completion of our filtering, we were left with 27.3 million relevant posts from February through May. The fun (read: science-y) part came next.

In order to figure out how each country will stack up against each other, we needed to assign values of strength to each team. These values were calculated according to each matchup and provided a representative game score. More specifically, when two teams are positioned to play against each other, we estimated the number of goals scored by each team using a Poisson distribution with four differently-weighted parameters learned using the Maximum Likelihood algorithm on prior games (qualifications, friendlies, etc.). The four parameters included these: 1. Team mentions in #WorldCup-related posts, 2. Team mentions in #soccer-related posts, 3. The average number of player mentions per team, and 4. The standard deviation of player mentions per team.

image

Finally, we were left with a statistical model predicting the outcome of each successive matchup based on our calculations. Taking into account the 27.3 million relevant posts, we had a complete bracket and a winner: Team Brazil.

image

Do you agree with our prediction? Think some other team will win? Make sure to check back to see how well the World Cup social frenzy on Tumblr predicted the outcome.