Netflix
This presentation is the property of its rightful owner.
Sponsored Links
1 / 6

Netflix PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on
  • Presentation posted in: General

Netflix. Netflix. Movie. User. Rents. MID Mname Date. UID CNAME AGE. TranID Rating Date. We Classify using a TraingingTable on a column (possibly composite), called class label column.

Download Presentation

Netflix

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Netflix

Netflix

Netflix

Movie

User

Rents

MID

Mname

Date

UID

CNAME

AGE

TranID

Rating

Date

We Classify using a TraingingTable on a column (possibly composite), called class label column.

The Netflix Contest classification use the RentsTrainingTable, Rents(MID,UID,Rating,Date) and class label Rating, to classify new (MID,UID,Date) tuples (i.e., predict ratings).

How? Since there are no features except date?

Nearest Neighbor User Voting: We won’t base “near” on a distance over features since there is only one feature, date.

uid votes if it is near enough to UID in it’s ratings of movies M={mid1, ..., midk}

(i.e., near is based on a User-User correlation over M ).

We need to select the User-User-Correlation (Pearson or Cosine or?) and the set M={mid1,…, midk }.

Nearest Neighbor Movie Voting:

mid votes if it’s ratings by U={uid1,..., uidk} are near enough to those of MID

(i.e., near is based on a Movie-Movie correlation over U).

We need to select the Movie-Movie-Correlation (Pearson or Cosine or?) and the set U={uid1,…, uidk }.

Netflix Cinematch predicts potential new ratings.

Using these predicted ratings, Netflix recommend next rentals (those that are predicted to be 5’s or 4’s?).


Netflix

The Program: Code Structure -the main modules

mpp-mpred.C

mpp-user.C

movie-vote.C

user-vote.C

prune.C

mpp-mpred.Creads a Neflix PROBE file, loops thru the (Mi, ProbeSupport(Mi) records passing each to mpp-user.C.

mpp-mpred.C can also call separate instances of mpp-user.C for many Us, to be processed in parallel (governed by the number of slots specified in the 1st code line.)

( Mi, ProbeSupport(Mi)={Ui1, …, Uik})

mpp-user.Cloops thru ProbeSupport(M), reads in the matchedup config file, prints prediciton(M,U) to file, predictions.

For the user-vote-approach, mpp-user.C calls user-vote.C

For the movie-vote-approach, mpp-user.C calls movie-vote.C

Loops thru ProbeSupport and from the user-vote and/or movie-VOTE calculates and writes Predict(Mi,Uik )

to predictions  UikProbeSupport(Mi)

user-vote.Cdoes specified pruning (from config etc) by calling prune.C, then loops thru pruned set of user voters. For each V, calculates a vote. Combines those votes (weighted?) into one vote and returns it.

vote(Mi ,Uik )

( Mi , Support(Mi),

Uik , Support(Uik ))

( Mi , Support(Mi),

Uik , Support(Uik ))

VOTE(Mi ,Uik )

movie-vote.Cdoes similarly.

Must we loop thru V’s (Vertical Processing of Horizontal Data or VPHD) rather than Horizontally Processing across Vertical Data (HPVD) consisting of the MoviePtrees for mid1, ..., midk, recalling from slide1 that V votes if it is near enough to UID in it’s ratings of movies M={mid1,..., midk}

The reason: the Horizontal Processing (Ptree processing) required of most correlation calculations is nearly impossible to formulate using AND/OR/COMP (can anyone do it???).


Netflix

What kind of pruning can be specified?

mpp-mpred.C

mpp-user.C

movie-vote.C

user-vote.C

prune.C

Again, all parameters are specified in a configuration file and the values specified there are consumed at runtime using, e.g., the call:

mpp -i Input_.txt_file -c config -n 16

where Input_.txt_file is the input Probe subset file and 16 is the number of parallel threads that mpp-mpred.C will generate (here, 16 movies are processed in parallel, each sent to a separate instantiation of mpp-user.C)

A sample config file is given later.

There are up to 3 types of pruning used (for pruning down support(M) as the set of all users that rate M or pruning down support(U) as the set of all movies that rate U:

1. correlation or similarity threshold based pruning

2. count based pruning

3. ID window based pruning

Under correlation or similarity threshold based pruning, and using support(M)=supM for example (pruning support(U) is similar) we allow any function f:supMsupM [0,HighValue] to be called a user correlation provided only that f(u,u)=HighValue for every u in supM. Examples include Pearson_Correlation, Gaussian_of_Distance, 1_perp_Correlation (see appendix of these notes), relative_exact_rating_match_count (Tingda is using), dimension_of_common_cosupport, and functions based on Standard Deviations.

Under count based pruning, we usually order by one of the correlations above first (into a multimap) then prune down to a specified count of the most highly correlated.

Under ID window based pruning we prune down to a window of userIDs within supM (or movieIDs within supU) by specifying a leftside (number added to U, so leftside is relative to U as a userID) and a width.


Netflix

How does one specify prunings?

mpp-mpred.C

specifies type of prune ( 3 types: UserPrune with a full range of possibilities;

UserFastPrune with just PearsonCorrelation pruning; CommonCoSupportPrune which orders users, V, according to the size of their CommonCoSupport with U only (note that this is a correlation of sorts too.)

mpp-user.C

movie-vote.C

user-vote.C

threshold "diff of vectors" population-based std_dev prune

specify leftside (from Uid) of an ID interval prune of supM

specify the width of an ID interval prune of supM

specify starting movie (intercept and slope) for N loop

specify starting movie (intercept and slope) for V loop

threshol for count based prune

specify PearsonCorr threshold (b=bill, meaning: use bill's formula - note if prior pruning this

will have a different value than Amal's)

specify PearsonCorr threshold (a=Amal, meaning: use Amal's table lookup)

threshold "vectorof diffs" population-based std_dev prune

threshold "vector of diffs"sample-based std_dev prune

threshold (Gaussian of) Euclidean distance based prune

threshold for (Gaussian of) 1perpendicular distance prune

exponent for (Gaussian of) 1perpendicular distance prune

threshold (Gaussian of) a variation based prune

threshold std_dev based prune

Picks odering for count-based prune below: 1=Amal_Pearson, 2=Bill_Pearson, etc.

threshold "diff of vectors"sample-based std_dev prune

prune.C

Again, in a file (this one is named config) there is a section for specifying the parameters for user-voting and a separate section for specifying parameters for movie-voting. E.g., for movie voting, at the bottom, there are 3 external prunings possible (0 or more can be chosen):

1. an intial pruning of dimensions to be used (since dimensions are user, it prunes supM):

2. a pruning of movie voters, N, (in supU)

3 a final pruning of dimensions (CoSupport(M,N) for the specific movie voter, N. E.g., parameters are specified for this final prune as follows:

[movie_voting Prune_Users_in_CoSupMN]

method = UserCommonCoSupportPrune

leftside = 0

width = 8000

mstrt = 0

mstrt_mult = 0.0

ustrt = 0

ustrt_mult = 0.0

TSa = -100

TSb = -100

Tdvp = -1

Tdvs = -1

Tvdp = -1

Tvds = -1

TD = -1

TP = -1

PPm = .1

TV = -1

TSD = -1

Ch = 1

Ct = 2

Note: all thresholds are for

similarities, not distance i.e., when we start with a distance we follow it with the Gaussian to make it a similarity or correlation.


Netflix

APPENDIX: The Program (old)

mpp-mpred.C

mpp-user.C

movie-vote.C

user-vote.C

prune.C

mpp-mpred.C reads a Neflix PROBE file, loops thru the (Mi, ProbeSupport(Mi)’s passing each to mpp-user.C, which calculates and prints PredictedRating(Mi,U) to the file, “prediction”,  UProbeSupport(Mi).

mpp-mpred.C can also call separate instances of mpp-user.C for many Us, to be processed in parallel (governed by the number of "slots" specified in 1st code line.)

(Mi, ProbeSupport(Mi))

From votes, calculates and writes Predict(Mi,U)

to predictions

 UProbeSupport(Mi)

mpp-user.Cloops thru ProbeSupport(M), reads in the matchedup config file, prints prediciton(M,U) to file, predictions.

For the user-vote-approach, mpp-user.C calls user-vote.C, passing (M, Support(M), U, Support(U)).

For the movie-vote-approach, mpp-user.C calls movie-vote.C, passing (M, Support(M), U, Support(U).

vote(M,U)

( M, Support(M),

U, Support(U))

( M, Support(M),

U, Support(U) )

VOTE(M,U)

user-vote.Cdoes the specified pruning (from the config etc.) by calling prune.C, then loops thru the pruned set of user voters. For each V, calculates a vote, then combines those votes and returns predictedRating(M,U).

Must we loop thru V’s (Vertical Processing of Horizontal Data or VPHD) rather than to calculate the vote by Horizontally Processing across the Vertical Data (HPVD) consisting of the MoviePtrees for mid1, ..., midk, recalling from slide1 that V=uid votes if it is near enough to UID in it’s ratings of movies M={mid1, ..., midk}.

The reason: the Horizontal Processing (Ptree processing) required of most correlation calculations is nearly impossible to formulate using AND/OR/COMP (so far, anyway).

movie-vote.C similar.


Netflix

APPENDIX: Collaborative Filteringis the prediction of likes and dislikes (retail or rental) from the history of previous expressed ratings (filtering new likes thru the historical filter of “collaborator” likes)

E.g., the $1,000,000 Netflix Contestwas to develop aratings prediction program that can beat the one Netflix currently uses (called Cinematch) by 10% in predicting what rating users gave to movies. I.e., predict rating(M,U) where (M,U)  QUALIFYING(MovieID, UserID).

Netflix uses Cinematch to decide which movies a user will probably like next (based on all past rating history). All ratings are "5-star" ratings (5 is highest. 1 is lowest. Caution: 0 means “did not rate”).

Unfortunately rating=0 does not mean that the user "disliked" that movie, but that it wasn't rated at all. Most “ratings” are 0. That’s the main reason we don’t want to use standard vector space distance for “near”.

A "history of ratings given by users to movies“, TRAINING(MovieID, UserID, Rating, Date) is provided, with which to train your predictor, which will predict the ratings given to QUALIFYING movie-user pairs (Netflix knows the rating given to Qualifying pairs, but we don't.)

Since the TRAINING is very large, Netflix also provides a “smaller, but representative subset” of TRAINING, PROBE(MovieID, UserID)(~2 orders of magnitude smaller than TRAINING).

Netflix gave 5 years to submit QUALIFYING predictions. That contest was won in the late summer of 2009, when the submission window was about 1/2 gone.

The Netflix Contest Problem is an example of the Collaborative Filtering Problem which is ubiquitous in the retail business world (How do you filter out what a customer will want to buy or rent next, based on similar customers?).


  • Login