Created 16th October, 2006 12:43 (UTC), last edited 28th October, 2006 09:56 (UTC)
I find the Netflix Prize interesting on three main counts:
- It's an interesting computational problem which seems to have all sorts of avenues for attack.
- It's got a million dollar prize which can't be anything but interesting.
- The dataset sizes and computation framework make it a good test of FOST.3™'s capabilities.
A more minor point is that I like this way for companies to conduct research. They get a great deal with armies of people attacking the problem, and it allows armchair researchers — dillatentes in fact — to play with data sets and algorithm design that we'd not normally get a chance to play with. You simply cannot make up this volume of data so it's great that somebody else has taken the time and effort to prepare it for us.
So far I've started to read the data set into a FOST.3™ database and I have some ideas for some simple algorithms that certainly aren't going to win any prizes, but will help to refine the changes in the batch processing systems.
Data volume
It shouldn't be underestimated just how much static data Netflix have supplied.
- The supplied data files are around 2GB uncompressed.
- There are 17,700 movies.
- There are about 480,000 customers.
- The qualifying data set is around 2.8 million ratings. It is the algorithm's ratings for this data set which must be submitted to Netflix for scoring. The score that they present on the leader board is then for a sub-set of these ratings.
- The probe data set contains around 1.4 million customer ratings. The probe data set has been selected by Netflix to be statistically similar to the qualifying data set, but for which the actual customer ratings have also been supplied.
- The training data set (with the probe removed) still contains around 100 million customer ratings.
Pages
- Design
- Static data
- Class diagram