1 / 49

CoBaFi : Collaborative Bayesian Filtering

CoBaFi : Collaborative Bayesian Filtering. Alex Beutel Joint work with Kenton Murray, Christos Faloutsos , Alex Smola April 9, 2014 – Seoul, South Korea. Online Recommendation. Movies. 5. 5. 2. Users. 5. 3. 5. Online Rating Models. Online Rating Models. Reality.

cleo
Download Presentation

CoBaFi : Collaborative Bayesian Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CoBaFi:Collaborative Bayesian Filtering Alex Beutel Joint work with Kenton Murray, Christos Faloutsos, Alex Smola April 9, 2014 – Seoul, South Korea

  2. Online Recommendation Movies 5 5 2 Users 5 3 5

  3. Online Rating Models

  4. Online Rating Models Reality Normal Collaborative FilteringFit a Gaussian - Minimize the error Minimizing error isn’t good enough - Understanding the shape matters!

  5. Online Rating Models Normal Collaborative FilteringFit a Gaussian - Minimize the error Our Model

  6. Our Goals and Challenges • Given: A matrix of user ratings • Find: A model that best fits and predicts user preferences • Goals: • G1. Fit the recommender distribution • G2. Understand users who rate few items • G3. Detect abnormal spam behavior

  7. 1. Background Outline 2. Model Formulation 3. Inference 4. Catching Spam 5. Experiments

  8. Collaborative Filtering [Background] Movies V Users X U ≈ Genres 5 = 5 =

  9. Bayesian Probabilistic Matrix Factorization (Salakhutdinov& Mnih, ICML 2008) μU ~ [Background] …

  10. 1. Background Outline 2. Our Model 3. Inference 4. Catching Spam 5. Experiments

  11. Our Model Cluster users (& items) Share preferences within clusters Use user preferences to predict ratings

  12. The Recommender Distribution First introduced by Tan et al, 2013 Linear Normalization Quadratic Normalization θ1 = 0 Vary θ2 θ2= 0.4 θ2= -1.0

  13. The Recommender Distribution ui Genre Preferences General Leaning How Polarized • Goal 1: Fit the recommender distribution

  14. Understanding varying preferences 5 2 5 3 1 1 5

  15. Resulting Co-clustering V U

  16. Finding User Preferences μU μU’ • Goal 2: Understand users who rate few items

  17. Chinese Restaurant Process μ1 μ3 μ2

  18. 1. Background Outline 2. Our Model 3. Inference 4. Catching Spam 5. Experiments

  19. Gibbs Sampling - Clusters [Details] Probability of picking a cluster = Probability of a cluster based on size (CRP) x Probability uiwould come from the cluster

  20. Sampling user parameters [Details] Probability of user preferences ui = Probability of preferences ui given cluster parameters x Probability of predicting ratings ri,jusing new preferences Recommender distribution is non-conjugate Can’t sample directly!

  21. 1. Background Outline 2. Our Model 3. Inference 4. Catching Spam 5. Experiments

  22. Review Spam and Fraud 1 5 5 5 1 1 5 1 1 5 1 1 5 1 1 5 Image from http://sinovera.deviantart.com/art/Cute-Devil-117932337

  23. Clustering Fraudsters μ3 μ1 μ2 New Spam Cluster Previous “Real” Cluster

  24. Clustering Fraudsters μ3 μ1 μ2 Too much spam – get separated into “fraud” cluster Trying to “hide” just means (a) very little spam or (b) camouflage reinforcing realistic reviews.

  25. Clustering Fraudsters μ4 μ1 μ3 μ2 μ5 Naïve Spammers Spam + Noise Hijacked Accounts • Goal 3: Detect abnormal spam behavior

  26. 1. Background Outline 2. Our Model 3. Inference 4. Catching Spam 5. Experiments

  27. Does it work? Better Fit

  28. Catching Naïve Spammers Injection 83% are clustered together

  29. Clustered Hijacked Accounts Clustered hijacked accounts Clustered “attacked” movies Injection

  30. Real world clusters

  31. Shape of real world data

  32. Shape of Netflix reviews More Skewed More Gaussian

  33. Shape of Amazon Clothing reviews Nearly all are heavily polarized!

  34. Shape of Amazon Electronics reviews Nearly all are heavily polarized!

  35. Shape of BeerAdvocate reviews Nearly all are Gaussian!

  36. Hypotheses on shape of data vs. • Hard to evaluate beyond binary • Selection bias – Only committed viewers watch Season 4 of a TV series • Hard to compare value across very different items. • Lots of beers and movies to compare • Fewer TV shows • Even fewer jeans or hard drives

  37. Key Points • Modeling: Fit real data with flexible recommender distribution • Prediction: Predict user preferences • Anomaly Detection: When does a user not match the normal model?

  38. Questions? Alex Beutel abeutel@cs.cmu.edu http://alexbeutel.com

  39. Sampling Cluster Parameters μα Hyperparametersμα, λα, Wα, ν μa Priors on μα, λα, Wα u5 u6

  40. Gibbs Sampling - Clusters [Details] Probability uiwould be sampled from cluster a Probability of a cluster (CRP)

  41. Sampling user parameters [Details] Probability of uigiven cluster parameters Probability of predicting ratings ri,j Recommender distribution is non-conjugate Can’t sample directly! Use a Laplace approximation and perform Metropolis-Hastings Sampling

  42. Sampling user parameters [Details] Use candidate normal distribution Mode of p(ui) “Variance” of p(ui) Metropolis-Hastings Sampling: Sample Keep new with probability

  43. Sampling Cluster Parameters [Details] Users/Items in the cluster Priors

  44. Inferring Hyperparameters [Details] Solved directly – no sampling needed! Prior hidden as additional cluster

  45. Does Metropolis Hasting work? • Have to use non-standard sampling procedure: • 99.12% acceptance rate for Amazon Electronics • 77.77% acceptance rate for Netflix 24k

  46. Does it work? Compare on Predictive Probability (PP) to see how well our model fits the data

  47. Handling Spammers Random naïve spammers in Amazon Electronics dataset Random hijacked accounts in Netflix 24k dataset

  48. Clustered Naïve Spammers 83% are clustered together

  49. Clustered Hijacked Accounts Clustered hijacked accounts Clustered “attacked” movies

More Related