slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
The Netflix Challenge Parallel Collaborative Filtering PowerPoint Presentation
Download Presentation
The Netflix Challenge Parallel Collaborative Filtering

Loading in 2 Seconds...

play fullscreen
1 / 23

The Netflix Challenge Parallel Collaborative Filtering - PowerPoint PPT Presentation


  • 300 Views
  • Uploaded on

The Netflix Challenge Parallel Collaborative Filtering. James Jolly Ben Murrell CS 387 Parallel Programming with MPI Dr. Fikret Ercal. What is Netflix?. subscription-based movie rental online frontend over 100,000 movies to pick from 8M subscribers 2007 net income: $67M.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The Netflix Challenge Parallel Collaborative Filtering' - liam


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

The Netflix ChallengeParallel Collaborative Filtering

James Jolly

Ben MurrellCS 387Parallel Programming with MPIDr. Fikret Ercal

slide2

What is Netflix?

  • subscription-based movie rental
  • online frontend
  • over 100,000 movies to pick from
  • 8M subscribers
  • 2007 net income: $67M
slide3

What is the Netflix Prize?

  • attempt to increase Cinematch accuracy
  • predict how users will rate unseen movies
  • $1M for 10% improvement
slide4

The contest dataset…

  • contains 100,480,577 ratings
  • from 480,189 users
  • for 17,770 movies
slide5

Why is it hard?

  • user tastes difficult to model in general
  • movies tough to classify
  • large volume of data
slide6

Sounds like a job for collaborative filtering!

  • infer relationships between users
  • leverage them to make predictions
slide7

Why is it hard?

UserMovieRating

Dijkstra Office Space 5

Knuth Office Space 5

Turing Office Space 5

Knuth Dr. Strangelove 4

Turing Dr. Strangelove 2

Boole Titanic 5

Knuth Titanic 1

Turing Titanic 2

slide8

What makes users similar?

Office Space

Titanic

Dr. Strangelove

slide9

What makes users similar?The Pearson Correlation Coefficient!

Office Space

Titanic

Dr. Strangelove

pc = .813

slide11

Predicting user ratings…

Would Chomsky like “Grammar Rock”?

  • approach:
  • use matrix to find users like Chomsky
  • drop ratings from those who haven’t seen it
  • take weighted average of remaining ratings
slide12

Predicting user ratings…

Suppose Turing, Knuth, and Boole rated it 5, 3, and 1.

Since .125 + .5 + .5 = 1.125, we predict…

rChomsky = ( (.125/1.125)5 + (.5/1.125)3 + (.5/1.125)1 )/3

rChomsky = 1.519

slide13

So how is the data really organized?

user 1, rating ‘5’user 13, rating ‘3’user 42, rating ‘2’…

movie file 1movie file 2movie file 3…

user 13, rating ‘1’user 42, rating ‘1’user 1337, rating ‘2’…

user 13, rating ‘5’user 311, rating ‘4’user 666, rating ‘5’…

slide14

Training Data

  • 17,770 text files (one for each movie)
  • > 2 GB
slide15

Parallelization

  • Two Step Process:
  • Learning Step
  • Prediction Step
  • Concerns:
  • Data Distribution
  • Task Distribution
slide18

Parallelizing the learning step…

  • store data as user[movie] = rating
  • each proc has all rating data for n/p users
  • calculate each ci,j
  • calculation requires message passing(only 1/p of correlations can be calculated locally within a node)
slide19

Parallelizing the prediction step…

  • Data distribution directly affects task distribution
    • Method 1: Store all user information on each processor and stripe movie information(less communication)

predict(user, movie)

rating estimate

slide20

Parallelizing the prediction step…

  • Data distribution directly affects task distribution
    • Method 2: Store all movie information on each processor and stripe user information (more communication)

predict(user, movie)

gather

partialestimates

slide21

Parallelizing the prediction step…

  • Data distribution directly affects task distribution
    • Method 3: hybrid approach(lots of communication high number of nodes)

predict(user, movie)

slide22

Our Present Implementation

  • operates on a trimmed-down dataset
  • stripes movie information and stores similarity matrix in each processor
  • this won’t scale well!
  • storing all movie information on each node would be optimal, but nic.mst.edu can’t handle it
slide23

In summary…

  • tackling Netflix Prize requires lots of data handling
  • we are working toward an implementation that
  • can operate on the entire training set
  • simple collaborative filtering should get us close
  • to the old Cinematch performance