Who’s Better and Who’s the Best?

by Keren Gu

In 2009, Netflix gave out $1,000,000 to the group BellKor’s Pragmatic Chaos, for designing the best algorithm that predicts an individual’s movie preferences in the Netflix Challenge.

Starting in 2006, Netflix initiated a contest in search of an algorithm that would break the record – the record of a 10% improvement over Netflix’s current algorithm that matches people to the movies they love. After three years of intensive competition, thousands of algorithm submissions, and hundreds of thousands of dollars spent on the annual “best-so-far” algorithm, the 10% minimum line was finally broken. With an improvement of 10.6% over Netflix’s algorithm, BellKor’s Pragmatic Chaos took home 1 million dollars.

The goal of the Netflix Challenge was essentially to answer the question: is a given person more of an American Pie person, a Pride and Prejudice person, or something other? (Given substantial background input, of course.) Thus Netflix seeks the ideal algorithm for ranking movies to each user’s preferences.

People have been asking similar ranking questions for centuries. Ever since the beginning of sports and games, people have found ways to compare skills by ranking each other in terms of strength and intelligence.

In the game of Go, for example, a simple rating method was officially implemented and spread throughout China and Japan in as early as the 2nd century. In the ancient Go ranking system, a difference in rank represents the number of pieces one must give up in order to achieve a fair game. By fair game, the final score of the two players must not differ by more than 20. This Go ranking method is used in many areas of China even today.

Advances in mathematics led to new kinds of ranking algorithms—including, for example, one that is formulated for matching pools of males and females based on their rankings of the other pool, called the “dating algorithm”. This algorithm was subsequently adopted in the field of medical education, to match medical students with prospective residency programs.

These kinds of questions, which have long been of interest to mathematicians and more recently to computer scientists, are known as ranking problems in statistics.

Elo’s Rating System

Among the variety of ranking algorithms, there is one rating system that is so popular that almost everyone has encountered a ranking produced by this system.

This algorithm originated in the world of chess. If you are ever around chess players, you will hear them discuss and compare their ratings, sometimes referring to their “Elo’s”. The “Elo” is a chess player’s rating; depending on which chess federation the rating is within, it represents a comparison of his or her skill to the rest of the players in the same federation.

The chess rating system, a set of algorithms that computes and changes the player’s rating after each game and tournament, was devised by chess master and physicist Arpad Elo in the 1960s and adopted by both the United State Chess Federation and the World Chess Federation (FIDE). Elo’s ranking system has been modified in many different ways throughout the years into other ranking systems in sports such as soccer and baseball, and in many two-player online games, including Starcraft II and Yahoo! Games, as indicated by Yahoo! Help. According to the World Football Elo Ratings, a version of the Elo’s Rating system is being implemented to rate football (soccer, in American parlance) teams, providing up-to-date ratings of worldwide football teams on a daily basis. Similarly, according to a prolific author from LoreHound, an online game-news and commentary community, the game matching and rating algorithm behind the popular role-playing game Starcraft II is also based on Elo’s rating system.

Why is Elo’s rating system so widely adopted? Simplicity. Elo’s rating system needs only one piece of information – win, loss, or draw of the game. Elo does not care what specific moves the player made throughout the tournament, how long he or she took between moves, or how close each move was relative to the “best” move, etc. One byte of information per game per player simplifies the computation–a simplification that was imposed by the lack of computation power in the 1960s. Elo’s rating system also never mentions the name “chess”, making it available to any other two-player game in a large arena. More importantly, Elo’s ranking system makes numerous assumptions: for example, that a player’s performance is uniformly distributed around his or her own true performance level. This assumption reasonably tolerates rare performances that are exceptionally outstanding or upsetting. Elo’s assumptions produce an algorithm that allows any chess player to calculate his or her rating by simply using a hand-held calculator immediately after the game.

Any rating system consists of two basic and logical areas – what the ratings mean and how the ratings are computed. Here we begin by addressing the former. Given the ratings of two players, Albert and Bob, we can enter a few numbers into our calculator and obtain a fraction between 0 and 1. This number represents the chance of either Albert or Bob winning. The equation is actually rather simple, as Elo intended it to be. Let’s take a look at it:

E of Albert = 1/[1+10^((Bob-Albert)/400)],

E of Bob = 1/[1+10^((Albert-Bob)/400)]

If the ratings of Albert and Bob are the same, then (Bob-Albert) is zero, 10^0 is 1, and E = 0.5. This result simply means that the chance of either Albert or Bob winning is 50%. On the other hand, if Albert is a grand chess master, while Bob is a newbie, then the chance of Albert winning should be almost 99%. Thus let’s check using the equation: a typical rating of a chess grandmaster is anything above 2200, and a newbie can have a rating of 50. In this case, (Bob-Albert)/400 is a large negative number, which makes the denominator slightly greater than 1. This gives Albert a winning chance of 99.99%.

Incidentally, if you’ve seen the movie The Social Network, then this formula should look familiar to you. When Eduardo Savarin wrote the equation on the Harvard dorm window that gives the rating of girls, it was this Elo’s equation, except Andrew Garfield—the actor playing Savarin–seemed to have forgotten the exponential sign.

The latter area of Elo’s rating system addresses the change in rating after the game. A little more complicated than the previous section, the after-ranking of the player is defined by the follow equation:

After-Rating = Rating + K (S-(E of A))

… where S is 1 if A won and 0 if A lost, and K is a constant factor depending on the level of A, usually decreasing as their rating increases. In theory, the after-rating of a player is increased or decreased by a multiple of the difference of expected win and actual win, where the multiplier varies to make it more difficult to increase a person’s rating as he or she moves higher in the rating.

Half a century after Elo invented his rating algorithm for chess, one organization stood up to challenge it. In 2010, Chessmetrics organized a contest, called “Elo versus the Rest of the World” on Kaggle.com, a “platform for data prediction competitions”. Teams and individuals were invited to implement their own rating algorithm given a large set of win/loss data from a four-year period. To test their accuracy, their algorithms were run on one year of games that were played succeeding the four-year period, and their predictions were compared with the actual results. Each algorithm, including the original Elo’s algorithm, was ranked based on its deviation from the actual data. In this ranking of accuracy, Elo’s benchmark algorithm ranked in the lower 50 percentile.

What happened?

Elo’s rating system is simple, but not accurate. Much more complex and extensive algorithms were implemented during the contest, ones that require much more computational power. The theory behind the winning submission has not been published, but it was inevitable that it would be much more accurate than Elo’s prediction.

Three months later, the World Chess Federation (FIDE) organized the exact same contest on Kaggle.com, called the “FIDE Chess Rating Challenge”. This time, data from the past 11 years are published for the programmers, and $10,000 is up for grabs. As of April 2011, the competition has one month left, and the FIDE Benchmark is currently ranked as number 68 among 126 entries.

Supervised Ranking

While Elo’s system is simple and useful under many environments, as noted above, it has its shortcomings in its accuracy. One limiting factor for Elo’s system is its simplicity. Knowing only the win and loss of a game is not nearly enough to determine the rating of players. A chess coach would be able to predict much more accurately the wins and losses among his own students.

In supervised ranking, a different section of ranking, the rank of each object is determined by all of its features that may contribute to the measurement. Supervised ranking is also applied when there is no single attribute that can be considered. One especially intriguing application of supervised ranking, and one that has saved many lives, is that of the underground cable system in New York City.

The infrastructure of Manhattan’s underground cable system was laid decades ago, and continuous repairs are necessary to prevent cable failure. Even so, hundreds of explosions occur and fires often erupt through manholes every year. According to news outlets such as Fox News and Associated Press, reasons for these explosions vary from leakage of water onto electrical cables to the buildup of carbon monoxide. To address the problem of sudden explosions and fire, MIT Sloan School of Management Professor Cynthia Rudin led a team that used supervised ranking to “predict” the manholes that are most likely to explode, allowing maintenance crews to establish priorities in scheduling maintenance work.

How does Professor Rudin come up with the ranking algorithms? To find out, I met with her at the Sloan School to learn more about the process of supervised ranking.

As part of supervised ranking for manhole maintenance in New York City, Rudin needed all the information available on every manhole in the city. Her predictions were based on tens of thousands of “trouble tickets” from the past decade – providing a thorough history and a complete profile of every manhole. Then, using this set of data and a series “regularities” that she assumes, Rudin and her team were able to come up with a software tool that generates a “report” for every manhole and its surrounding environment.

While she is forbidden from speaking of the actual algorithm and the details regarding the Manhattan manhole project, we discussed other projects she worked on to gain insights into the process of producing a ranking algorithm.

Consider, for example, the Netflix Challenge, in which we are asked to rank movies based on preferences for every user. We have a list of movies that the user has already rated (Toy Story 3 > King’s Speech > The Social Network, for example – a pre-ranking step). Based on the features of these movies and their relative rank to each other, we are able to compute how much the user emphasizes a specific feature.

In practice, the number of movies that the user pre-ranks needs to be greater than the number of features associated with each movie. This requirement stems from a simple mathematical reason: we will be solving an N dimensional system of equations and we need the number of equations to be more than the number of unknowns.

The score of a particular movie is a linear combination of the features of that movie, and the ranking is based purely on the score of each movie. At the end of the pre-ranking step, we will have n number of equations, where n is the number of movies that they have ranked, and m number of unknowns, where m is the number of features per movie. Once we solve this system of equations, we will have the “weight” of each feature, indicating how much the user values a particular feature while ranking his or her preferences.

In ranking, regularities are assumed in every problem, and people’s behavior is assumed to follow these regularities, which is the only reason that statistical rankings exist. If we did not assume there to be regularities in human behavior, machines would never be able to predict our preferences. Similarly for manholes and other objects, there exist regularities in their behavior. According to Professor Rudin, the winner of the Netflix Challenge made a daring assumption that made it stand out – it assumed that only a certain number of features matter to the user, and these features vary from user to user.

Machine Learning

In contrast to the simple movie-ranking algorithm, most algorithms require more complexity. Many require what is called machine learning. Indeed, for one of her projects, Professor Rudin made use of the Microsoft search engine known as Bing.

At the mention of search engines, the first thing that pops into most people’s minds is the widely used Google. Bing uses a completely different approach from Google. According to the published source of Google’s page ranking algorithm, web pages are ranked on the basis of their importance, and their importance is based on a number of features, such as the number of links from and to that specific web page. This is only the tip of the Page Ranking iceberg, however, with almost all of its details unpublished and unavailable to the public.

Bing, on the other hand, uses supervised ranking and machine learning. Still a newborn method of organizing search results, there is currently no published explanation of how Bing works. Even the query “How does Bing work” on Bing.com gives no information, where one of the top results is “Bing shopping”!

We do know, however, how machine learning works. A machine-learning algorithm works in a similar way to a binary search algorithm, or quick-sort algorithm, with the exception that new information is constantly being fed back to the existing data set of the machine. The changes in the machine’s database produce changes in the machine’s behavior. When machine learning is applied to the search engine, Bing, the choices that the users make when they look through the search results affect the future behavior of Bing. If every user clicks the first result, but later on clicks the second or the third result, the machine will eventually conclude that the first result is not as useful as it was before.

Sports Ranking

While Google and other search engines grant us access to incredible amounts of information in a matter of a second, this is far from pinpointing the exactly information that we look for. As information becomes more available to the public, better and more accurate algorithms are necessary to give us exactly what we want. Another factor that contributes to the demand for highly accurate algorithms is the increasing pace of living. As I mentioned earlier, ranking algorithms aren’t used exclusively for ranking information; they’re useful for almost everything in our lives. An algorithm that accurately predicts the style of clothing a buyer will like is crucial to e-commerce and will bring abundant wealth to whoever develops the best algorithm.

Another area with the potential to accrue huge profits from ranking algorithms is sports.

As described earlier, a great number of sports ratings are still based on Elo’s rating system, which takes account of only the wins or losses of games. While many sports organizations use a modified version of Elo’s rating system, the winning probability provided by the system isn’t very accurate. According to many published sources, a number of individuals have devised sports rankings using a similar method to Google’s Page Ranking algorithms, also called Power Rank. Ed Feng, a researcher in statistical mechanics from Sandia National Laboratory, first suggested the use of “Power Rank” in sports ranking. In the Power ranking system, each webpage or sports team is designated as a node with connections to various other nodes. In sports, connections represent games played with an assigned value of winning and losing scores. With this information and a large dataset, the Power Rank algorithm allows us to mathematically compute the ranking of all teams simultaneously.

Even when people are given the same dataset, the algorithms that they implement generate varying results. Given a history report of every player on every team, it is possible to create a ranking algorithm with stunning accuracy and the ability to predict the future.

That ranking algorithm’s ability to predict the future could bring its creator oodles of cash.

Works Cited

Bell, R. and Koren, Y. “Lessons from the Netflix Prize Challenge.” <http://research.yahoo.com/files.kddexp.pdf>

“BellKor’s Pragmatic Chaos.” AT&T Research. <http://www2.research.att.com/~volinsky/netflix/bpc.html>

Bisky, Tom. “When Manholes Explode, New Yorkers Shrug.” Huffington Post. July 29th, 2010. <http://www.huffingtonpost.com/tom-bisky/when-manholes-explode-new_b_663880.html&gt;

Buskirk, Eliot Van. “How the Netflix Prize Was Won.” Wired. September 22nd 2009. <http://www.wired.com/epicenter/2009/09/how-the-netflix-prize-was-won/&gt;

Carvajal, K. and Funk, L. “Explosion In Manhattan” February 11, 2010. < http://www.myfoxny.com/dpp/news/local_news/nyc/100211-manhole-explosion-starts-major-fire&gt;

“Chess Ratings – Elo Versus the Rest of the World.” Kaggle.com. <http://www.kaggle.com/c/chess&gt;

“Deloitte/FIDE Chess Rating Challenge.” Kaggle.com. <http://www.kaggle.com/c/ChessRatings2>

Heartbourne. “Starcraft 2’s Battle.net Leagues, Ladders, and Rankings Explained.” Lorehound. August 4, 2010. <http://lorehound.com/starcraft/starcraft-2s-battle-net-leagues-ladders-and-rankings-explained/&gt;

“Historical Ratings” Chessmetrics. <http://db.chessmetrics.com/CM2/HistoricalRatings.asp&gt;

“How Do the Ratings Systems Work?” Yahoo! Help. December 16, 2008. <http://help.yahoo.com/l/us/yahoo/games/play/play-03.html>

Miller, Steven C. “The USCF Elo Rating System.” 2003. <http://www.renaissanceknights.org/IL%20Scholastic/Handouts/Handouts%20PDFs/EloRatingSystem.pdf&gt;

“The Netflix Prize Rules.” Netflix. October 2006. <http://www.netflixprize.com//rules&gt;

Ross, Daniel. “Arpad Elo and the Elo Rating System.” Fall, 2007. <http://www.chessbase.com/newsdetail.asp?newsid=4326&gt;

Vovk, Bohdan. “Chess Elo Rating in Simple.” Chesselo.  <http://www.chesselo.com&gt;

“The World Football Elo Rating System.” <http://www.eloratings.net/system.html&gt;

Keren Gu is a member of the class of 2014, majoring in course 18C (Mathematics with Computer Science). She was born in Beijing, China and moved to the U.S. at the age of 12. She went to high school in Rockville, Maryland, where she discovered her interest in mathematics. Keren is a member of the MIT sailing team and spends most of her free time sailing on the Charles.

The topic of this essay was inspired by a friend who loves chess and math. After finding the algorithm challenge on Kaggle.com, she realized that it is the most suitable topic for her “science essay for the general public”. In writing this essay, Keren had much fun learning about the topic. She discovered the potential of research in ranking algorithms and thought that more people should be exposed to this topic.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s