1 / 8

Netflix Prize: Predicting Ratings

Netflix Prize: Predicting Ratings. Data. mv_00(movieID).txt: 1: (1-2,649,429) (1-5) Over 17,000 movie txt files Over 400,000 userID Two Gigs zipped. Overall Plan. Compute user similarity using: termFrequency: # of movies in common documentFrequency: 1/|rating 1 – rating 2 |

rehan
Download Presentation

Netflix Prize: Predicting Ratings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Netflix Prize: Predicting Ratings

  2. Data • mv_00(movieID).txt: 1: (1-2,649,429) (1-5) • Over 17,000 movie txt files • Over 400,000 userID • Two Gigs zipped

  3. Overall Plan • Compute user similarity using: • termFrequency: # of movies in common • documentFrequency: 1/|rating1 – rating2| • tfdf = (# of movies in common) * 1/|rating1 – rating2|

  4. Plan 1 • Store it all in memory (haha) in java • Store a User class with: • UserID • Array of Movies classes: • movieID • Rating • Then have matrix of users with an array of top similar users using (tfdf) • Problem 1 - Memory issues

  5. Plan 2* • Step 1: store in text files on hard drive in java • text file for each user • Step 2: compute similarity (tfdf) • text file of top then users for each user • Step 3: predictions • Run through two directories of text files to compute an average movie rating prediction • Problem 2 - Very Slow: • Step 1: 3 days – ~5000 movie text files currently • Step 2: 1 user every 35 mins | 1 user every 5 mins • Step 3: ~10 minutes currently

  6. Plan 3 • Step 1: Store in text file’s data in a database using php • Table: userID | movieID | rating • Primary keys: userID, movieID • Step 2: Compute Similarity • Table: userID | 1st userIDs | 2nd userID | etc. • Primary key: userID • Step 3: Predictions • Problem 3 - Very Slow: • Step 1: 4 days – 7000 movie text files currently • Step 2: n/a • Step 3: n/a

  7. Results • Predicting everything 3.0: • RMSE = 1.3149 • Similarities I have so far: • RMSE = 1.3149| 384 users • RMSE = 1.3149| 575 users • http://www.netflixprize.com/leaderboard • Grand Prize RMSE = 0.8563 • RMSE: • sqrt(avg((actual_rating - predicted rating) * (actual_rating - predicted rating))).

  8. Future Idea

More Related