Lecture 5
1 / 14

SVD178winter07.ppt - PowerPoint PPT Presentation

  • Updated On :

Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization total of +/- 400,000,000 nonzero entries (99% sparse) movies (+/- 17,770) users (+/- 240,000) Remember User Ratings We are going to “squeeze out” the noisy, un-informative information from the matrix.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'SVD178winter07.ppt' - oshin

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Lecture 5 l.jpg

Lecture 5

Instructor: Max Welling

Squared Error Matrix Factorization

Remember user ratings l.jpg

total of +/- 400,000,000 nonzero entries

(99% sparse)

movies (+/- 17,770)

users (+/- 240,000)

Remember User Ratings

  • We are going to “squeeze out” the noisy, un-informative information from

  • the matrix.

  • In doing so, the algorithm will learn to only retain the most valuable information

  • for predicting the ratings in the data matrix.

  • This can then be used to more reliably predict the entries that were not rated.

Squeezing out information l.jpg



movies (+/- 17,770)

Squeezing out Information

total of +/- 400,000,000 nonzero entries

(99% sparse)

movies (+/- 17,770)

users (+/- 240,000)


users (+/- 240,000)

“K” is the number of factors, or topics.

Interpretation l.jpg

star-wars = [10,3,-4,-1,0.01]


  • Before, each movie was characterized by a signature

  • over 240,000 user-ratings.

  • Now, each movie is represented by a signature of

  • K “movie-genre” values (or topics, or factors)

  • Movies that are similar are expected to have similar

  • values for their movie genres.

movies (+/- 17,770)

users (+/- 240,000)


  • Before, users were characterized by their ratings over all movies.

  • Now, each user is represented by his/her values over movie-genres.

  • Users that are similar have similar values for their movie-genres.

Dual interpretation l.jpg
Dual Interpretation

  • Interestingly, we can also interpret the factors/topics as

  • user-communities.

  • Each user is represented by its “participation” in K

  • communities.

  • Each movie is represented by the ratings of these

  • communities for the movie.

  • Pick what you like! I’ll call them topics or factors from now on.


movies (+/- 17,770)

users (+/- 240,000)


The algorithm l.jpg
The Algorithm



  • In an equation, the squeezing boils down to:

  • We want to find A,B such that we can reproduce the observed ratings as

  • best as we can, given that we are only given K topics.

  • This reconstruction of the ratings will not be perfect (otherwise we over-fit)

  • To do so, we minimize the squared error (Frobenius norm):

Prediction l.jpg

  • We train A,B to reproduce the observed ratings only and we ignore all the

  • non-rated movie-user pairs.

  • But here come the magic: after learning A,B, the product AxB will have filled

  • in values for the non-rated movie-user pairs!

  • It has implicitly used the information from similar users and similar movies

  • to generate ratings for movies that weren’t there before.

  • This is what “learning” is: we can predict things from we haven’t seen before

  • by looking at old data.

The algorithm8 l.jpg
The Algorithm

  • We want to minimize the Error. Let’s take the gradients again.

  • We could now follow the gradients again:

  • But wait,

    • how do we ignore the non-observed entries ?

    • for netflix, this will actually be very slow, how can we make this practical ?

Stochastic gradient descent l.jpg
Stochastic Gradient Descent

  • First we pick a single observed movie-user pair at random: Rmu.

  • The we ignore the sums over u,m in the exact gradients, and do an update:

  • The trick is that although we don’t follow the exact gradient, on average

  • we do move in the correct direction.

Stochastic gradient descent10 l.jpg
Stochastic Gradient Descent

stochastic updates

full updates (averaged over all data-items)

  • Stochastic gradient descent does not converge to the minimum, but “dances”

  • around it.

  • To get to the minimum, one needs to decrease the step-size as one get closer

  • to the minimum.

  • Alternatively, one can obtain a few samples and average predictions over them

  • (similar to bagging).

Weight decay l.jpg

  • Often it is good to make sure the values in A,B do not grow too big.

  • We can make that happen by adding weight decay terms which keep them small.

  • Small weights are often better for generalization, but this depends on how

  • many topics, K, you choose.

  • The simplest weight decay results in:

  • But a more aggressive variant is to replace

Incremental updates l.jpg
Incremental Updates

  • One idea is to train one factor at a time.

    • It’s fast

    • One can monitor performance on held out test data

  • A new factor is trained on the residuals of the model up till now.

  • Hence, we get:

  • If you fit slow (i.e. don’t go to convergence, weight decay, shrinkage) results are better.

Pre processing l.jpg

  • We are almost ready for implementation.

  • But we can make a head-start if we first remove some “obvious” structure

  • from the data, so the algorithm doesn’t have to search for it.

  • In particular, you can subtract the user and movie means where you

  • ignore missing entries.

  • Nmu is the total number of observed movie-user pairs.

  • Note that you can now predict missing entries by simply using:

Homework l.jpg

  • Subtract the movie and user means from the netflix data.

  • Check theoretically and experimentally that the both user and movie means are 0.

  • make a prediction for the quiz data and submit it to netflix.

  • (your RMSE should be around 0.99)

  • Implement the stochastic matrix factorization and submit to netflix.

  • (you should get around 0.93)