Modeling Concept Drift in Collaborative Filtering with Temporal Dynamics

Collaborative filtering with temporal dynamics Max Naylor

Background • Netflix Prize: In 2009, Netflix held an open competition for the best collaborative filtering algorithm to predict user ratings for films (grand prize: $1M), releasing 100+ million anonymized user-movie ratings, now called the Netflix Prize dataset • Collaborative filtering is the most popular approach to implementing recommender systems, leveraging past user behaviour to give customized recommendations • Concept drift is the phenomenon of user behaviour changing over time, either on a global scale or a local scale

The problem: how do you model user’s preferences as they change over time? • Goal for modeling concept drift • Drop temporary effects with very low impact on future behaviour • Capture longer-term trends that reflect inherent nature of data • Modeling localized concept drift • Global drifts can affect the whole population • Seasons / holidays / etc. • What about the unique local drifts that affect specific users differently? • Change in user’s music taste / family structure / etc.

Three usual approaches to the concept drift problem 1. Instance selection (time-window approach) • Discard instances that are deemed less relevant to current state • Reasonable with abrupt time shifts • Not great with gradual shifts t

Three usual approaches to the concept drift problem 2. Instance weighting • Use time decay function to underweight instances as they occur deeper in the past t

Discarding or under-weighting the past trashes valuable signal The best time decay function turns out to be no decay at all • Previous preferences tend to linger • Or, previous preferences help establish cross-user/cross-product patterns that are indirectly useful in modeling other users t

Three usual approaches to the concept drift problem 3. Ensemble learning: • Maintain family of predictors that jointly produce final outcome • Predictors that were more successful on recent instances get higher weights t

Ensemble learning... ... is not a good fit for collaborative filtering + temporal dynamics • Misses global patterns by using multiple models that each only consider a fraction of total behaviour • To keep track of independent, localized drifting behaviours, ensemble learning requires a separate ensemble for each user • Which complicates integrating information across users • Which is the cornerstone to how collaborative filtering works t

Model Goals • Explain user behaviour along the full time period • Capture multiple separate drifting concepts • User-dependent or item-dependent • Sudden or gradual • Combine all drifting concepts within a single framework • Model interactions that cross users and items, identifying higher-level patterns • Do not try to extrapolate future temporal dynamics • Could be helpful, but too difficult • BUT capturing past temporal dynamics helps predict future behaviour

Two Collaborative Filtering Methods How to compare fundamentally different objects -- users and items? • Neighbourhood approach • Item-item • Transform users into item space by viewing them as baskets of rated items • Leverages the similarity between items to estimate a user’s preference for a new item • User-user • Transform items into user space by viewing items as baskets of user ratings • Leverages the similarity between users to estimate user’s preference for a new item Item space Items Users Item-item

Two Collaborative Filtering Methods How to compare fundamentally different objects -- users and items? 2. Latent factor models • Transform users and items to the same latent factor space to be directly comparable • Use singular value decomposition to automatically infer factors from user ratings that characterize movies and users • pi represents the user’s affinity for a factor • qirepresents the movie’s relation to a factor User action superheroes ? Item

Static latent factor model Vanilla model: captures interactions between users and items • Set f = number of factors/dimensions in the space • Find vector pu∈ ℝffor each user u and vector qi∈ ℝffor each item i • A rating is predicted as r̂ui= qiTpu ∈ ℝ • Learn pu and qi by minimizing (L2 regularized) squared error: where K = {(u,i,t) | rui(t) is known}

Adding baseline predictors to LFM r̂ui? Baseline predictors: absorb user- or item- biases • Let µ= overall average rating • Let buand bi are observed biases of user u and item i • Then the baseline predictor for an unknown rating rui isbui = µ + bu + bi • Adding user-item interaction term, ratings are predicted asr̂ui= µ + bu + bi+ qiTpu Jordan Black Panther Average rating over all movies:µ = 3.7 Black Panther tends to be rated higher than average:bi = 0.5 Jordan tends to be more critical than average: bu= -0.3 Baseline estimate: bui = µ + bu + bi= 3.7 + 0.5 - 0.3 = 3.9

Modeling time dynamics of baseline predictors Time-dependent baseline predictors: Absorbing user- and item- biases as they change over time • Let bu= bu(t) and bi= bi(t) be functions of time • Now the baseline predictor for unknown rating by user u of item i at time t is bui(t)= µ + bu(t) + bi(t) • Movie likeability doesn’t usually fluctuate significantly over time • User-biases can change daily, so model requires finer time resolution

Modeling time dynamics of item biases Modeling item bias bi (t) over time using time-based bins • How big to make the bins? • Want finer resolution → smaller bins • Need enough ratings per bin → larger bins • Authors’ choice: • 30 bins total, 10 consecutive weeks per bin • For any day t, Bin(t) ∈ [1,30] represents which bin t belongs to • So bias of an item i on day t is bi(t) = bi + bi,Bin(t) • Baseline predictor using time-based bins to absorb item bias over time:bui(t) = µ + bu(t) + bi(t) = µ + bu(t) + bi + bi,Bin(t)

Modeling time dynamics of user biases Capturing userbiases bu(t) over time with a linear model • Let tu = overall average date of ratings by user u • Then |t - tu| measures number of days between day t and tu • Define the time deviation of a rating by user u on day t to be devu(t) = sign(t - tu) · |t - tu|β , where β = 0.4 by cross-validation • Find αu = regression coefficient of devu(t), sobias of a user u at time t isbu(t) = bu + αu · devu(t) • Now, the baseline predictor with linear user bias looks like this:bui(t) = µ + bu(t) + bi + bi,Bin(t) = µ + bu + αu · devu(t)+ bi + bi,Bin(t)

Modeling time dynamics of user biases Capturing user biases bu(t) over time with a more-flexible splines model instead of a simple linear model • User u gives nu ratings • Choose kutime points, , spaced uniformly across total rating time of user u • coefficient for each control point (learned from data) • Number of control points balances flexibility and computational efficiency • Authors’ choice: ku = nu0.25, grows with number of available ratings (some users rate more movies) • Parameter γ determines smoothness of spline • Authors’ choice: γ = 0.3 by cross-validation

But what about sudden drifts? • Previous smooth functions model gradual concept drift • However, sudden concept drifts emerge as “spikes” associated with single day or session • To address short-lived effects, assign a single parameter, bu,t per user u and day t to absorb day-specific variability • Linear model: bu(t) = bu + αu · devu(t) + bu,t • Splines model:

Find parameters and build baseline models Select baseline predictor model (e.g., linear): Learn bu,αu, bu,t, bi and bi,Bin(t):

static mov linear spline linear+ bui(t) = linear + bu,t spline+ bui(t) = spline + bu,t

Results (baseline only!) • Add time-dependent scaling feature cu(t) = cu + cu,tper user to item bias • cu= average rating of user u over time • cu,t = day-specific variability from average rating • Baseline predictor is now • RMSE = 0.9555 for baseline model; even before capturing any user-item interactions, it can explain almost as much variability as commercial Netflix Cinematch recommender system (RMSE = 0.9514 on the same test set)

Adding back user-item interaction... • Temporal dynamics affect user preferences ⇒ Temporal dynamics affect user-item interactions • e.g., “psychological thrillers” fan → “crime dramas” fan • Similarly define latent factors vector pu as function of time pu(t)So for a user u and a factor k, the element puk(t) of pu(t) becomes

Putting it all together (baseline + user-item interaction) SVD:r̂ui (t) = qiTpu SVD++: r̂ui (t) = qiT( pu + |R(u)|-1/2∑j∈R(u)yj )timeSVD++: (f= number of factors)

Example: What can we learn from our results? Question: Why do ratings rise as movies become older?Two hypotheses: • People watch new movies indiscriminately, but only watch an older movie after a more careful selection process. An improved user-to-movie match would be captured by the interaction part of the model rising with movies’ age. • Older movies are just inherently better than newer ones. This would be captured by the baseline part of the model.

Example: Neighbourhood approach Question: Why do ratings rise as movies become older?Answer:

Takeaways • Addressing temporal dynamics in the data can have a more significant impact on accuracy than designing more complex learning algorithms • (Yielded the best results published so far on a widely-analyzed high-quality movie rating dataset) • Modeling time as a dimension of the data can help uncover interesting inherent patterns in the data • Even past behaviour that is entirely different from current behaviour is still useful for predicting future behaviour

Questions?

More effects: Periodic • Some items more popular in specific seasons or near certain holidays • e.g., period(t) = {fall, winter, spring, summer} • Users may have different attitudes or buying patterns during weekend vs working week • e.g., period(t) = {Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday}

Neighbourhood approach • Less accurate than factor models • But popular for explaining the reasoning behind computed recommendations and seamlessly accounting for new entered ratings No temporal dynamics: With temporal dynamics:

Modeling Concept Drift in Collaborative Filtering with Temporal Dynamics

Modeling Concept Drift in Collaborative Filtering with Temporal Dynamics

Presentation Transcript

Cross-Selling with Collaborative Filtering

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering 101

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering Recommendation

Cross-Selling with Collaborative Filtering

Active Collaborative Filtering

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering

Collaborative Filtering

Collaborative filtering with privacy

Collaborative Filtering with Preventing Fake Ratings

Collaborative Filtering Recommendation

Collaborative Filtering:

Collaborative Filtering with Temporal Dynamics