As soon as you start looking at the data set it becomes obvious why it is so difficult to get good results. Databases don't have the linear algebra and other mathematical tools for taking a run at the prize but they are convenient for exploring data sets, so I loaded the data into a SQL Anywhere database (The developer edition is a free download, and I'll provide a perl script to load the data if you really want it) and started poking around. Here are a few of the more obvious oddities (all these observations have been posted elsewhere - see the Netflix prize forum for more):There's lots more worthwhile analysis of the strengths and weaknesses of recommender systems in general in the post. (Via Marginal Revolution.)
* Customer 2170930 has rated 1963 titles and given each and every one a rating of one (very bad). You would think they would have cancelled their subscription by now.
* Five customers have rated over 10,000 of the 17,770 titles selected - and presumably they also have rated some of the others among the 60,000 or so titles Netflix had available when they released the ratings. Are these real people?
* Customer 305344 had rated 17654 titles. Even though Netflix make it easy to rate titles that you have not rented from them (so they can get a handle on your preferences) can this be real?
* Customer 1664010 rated 5446 titles in a single day (October 12, 2005).
* Customer 2270619 has rated 1975 titles. 1931 were given a 5, 31 were given a 4, 10 given a 3, 2 given a 2 (Grumpy Old Men and Sex In Chains) and a single title was given a 1. That title? Gandhi, which has an average rating of over 4 and which less than 2% of those who watch it give a 1.
* The most often rated movie? Miss Congeniality with ratings by over 232,000 of the 480,000 customers. And which title is most similar to it in terms of ratings (using a slightly weighted Pearson formula)? Bloodfist 5: Human Target.
* Most highly rated - Lord of the Rings: Return of the King (Extended Edition), with 4.7.
Monday, August 06, 2007
"The Netflix Prize: 300 Days Later". One entertaining tidbit: