Introductory Demo

Introductory Demo

Introductory Demo

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

1. Introductory Demo • http://www.caffeinatedrook.com/MovieRec/MovieRecServlet

2. Problem Statement Data: User-Movie Ratings Input: User number and Movie number Output: Predicted Rating Goal: Predict Ratings with the smallest RMSD possible. (Make Customer happy.) -0.07283353 0.11694841 -0.0622078 0.081832446 0.049034953 -0.008441236 0.004925302 0.001412398 0.05334269 0.001291469 0.06918105 -0.08339876 0.12175138 0.17088805 -0.1085485 -0.07531176 -0.083747916 0.12860337 + = 4.312 / 5

3. Motivation – Netflix • “The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences. The Netflix Prize improves our ability to connect people to the movies they love.” www.netflix.com

4. Impact to Field • Better Recommendations = Happy Customers • Happy Customers = More Money = Larger Market Share

5. Motivation - Personal • One million of them • Feature Extraction • My uber-competitive nature (aka Justin’s Wife)

6. A problem with the problem statement • Tackling the Netflix Challenge requires many hundreds (thousands…more?) of hours of computation. • Ultimately, it will require the solution to many sub-problems. • Sparcity • Noise • Memory Requirements • Movie Similarity • User Similarity (more on these a little later)

7. The Problem Statement Redefined Data: User-Movie Ratings Goal: Discover the relationships between.. Movies to other movies Users to other users Movies to users

8. Related Work • Netflix Prize forum http://www.netflixprize.com//community/ • Lots of info on strategies people are trying. www.netflix.comwww.blockbuster.com www.amazon.comwww.spout.com • Singular value decomposition and least squares solutions, Numerische Mathematik Springer Berlin / Heidelberg • Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) • Predicting User Preference for Movies using NetFlix database, Dhiraj Goel and Dhruv Batra, Carnegie Mellon University • The Netflix Prize, James Bennett Stan Lanning • Use of KNN for the Netflix Prize, Ted Hong, Dimitris Tsamis, Stanford University • How To Break Anonymity of the Netflix Prize Dataset, Arvind Narayanan, Vitaly Shmatikov

9. Domain Understanding The success or failure of retailers rely on matching the customer to the product. In the case of online retailers, like Netflix, recommender systems can be built to utilize the vast sums of data generated online. Netflix keeps a record for each user, containing the rating (1-5) for each film the user has rated.

10. Data Selection • First, what the data does not contain. • It does not contain • Movie titles, directors, actors, studio, year • Customer age, sex, income, favorite color • Some contestants have written web-crawlers to mine this information from the web.

11. Data Selection • 17,770 Movies • 480,189 users • 100,480,507 Ratings • 17,770 * 480,189 • = 8,532,958,530 • 100,480,507 / 8,532,958,530 • = 0.01177 • (%98.8 sparse!) • 6: • 2031561,1,2004-07-26 • 1176140,1,2004-02-16 • 2336133,2,2004-09-05 • 1521836,1,2004-08-11 • 117277,3,2004-10-12 • 326587,3,2004-09-06 • 1961542,3,2004-04-20 • 1041552,3,2004-10-19 • 1678346,3,2005-04-11 • 643182,2,2004-07-18 • 2182301,5,2004-08-04 • 2502669,2,2004-02-10 • 2211030,4,2004-05-26 • 603277,3,2004-12-13 • 214166,2,2005-10-09 • …….. • ……..

12. Cleaning and Preprocessing • Transformed files from movie-view to user-view. • Normalized user ratings via Z-Score normalization.

13. Discovering Patterns • Which Software to use? SPSS, SAS, Weka? • 8,532,958,530 ratings * 4 bytes / rating • 34,131,834,120 bytes • 33,331,869 kilobytes • 32,550 megabytes • 31 gigabytes • Too big to hold the entire matrix • Too big to hold condensed matrix • Too “stupid” to manage memory without paging.

14. Discovering Patterns • Which feature selection method to use? • Principle Component Analysis • Singular Value Decomposition • Multifactor Dimensionality Reduction • Latent Semantic Analysis

15. Discovering Patterns M = 17,770 * 25 D = 17,770 * 480,189 8,532,958,530 444,250 12,004,725 12,448,975 Movie: a User: b vab = ∑i(Uai x Mbi) 1000 .001 1c/5h 25c / ~5 U = 25 * 480,189

16. A little board work to explain the algorithm

17. Interpretation: Feature 1-movie view Sweet Potato Pie Legion of the Dead Dark Town Comedy Only in Da Hood Predator Island Bad Bizness Vampiyaz My Big Phat Hip Hop Family Jack O'Lantern Desperate Souls Trailer Park Boys: Season 3 Trailer Park Boys: Season 4 The Lord of the Rings: The Fellowship of the Ring: Extended Edition Lord of the Rings: The Return of the King: Extended Edition Lord of the Rings: The Two Towers: Extended Edition Lost: Season 1 Veronica Mars: Season 1 House As Time Goes By: Series 9 Gilmore Girls: Season 4

18. Interpretation: Feature 2-movie view Lost in Translation National Lampoon's Mr. Wong Without You I'm Nothing Punch-Drunk Love Dogville The Royal Tenenbaums Whiteboyz Pornografia Spooks & Creeps Kaaterskill Falls Dragon Ball Z: World Tournament Dragon Ball: Piccolo Jr. Saga: Part 2 Dragon Ball: Tien Shinhan Saga Dragon Ball Z: Fusion Dragon Ball: Red Ribbon Army Saga Dragon Ball Z: Garlic Jr. Dragon Ball: Piccolo Jr. Saga: Part 1 Armageddon Dragon Ball: The Path to Power Pearl Harbor

19. Interpretation: Feature 3-movie view Nostradamus: A Voice from the Past Absolution Ozzy Osbourne: Double O: Unauthorized Monster-a-Go-Go! Dark Harvest 2: The Maize Jessica: A Ghost Story Vanilla Sky Ivan Vasilievich: Back to the Future American Beauty Still Bout It WWE: Rebellion 2002 Battle Athletes: Vol. 3: Go Sailor Moon: Vol. 10: The Trouble With Rini Battle Athletes Victory: Vol. 7: The Last Dance Battle Athletes Victory: Vol. 1: Training Battle Athletes Victory: Vol. 8: The Human Race! ECW: Extreme Evolution: Extreme Championship Wrestling Battle Athletes Victory: Vol. 6: Willpower Fushigi Yugi: The Mysterious Play: Eikoden Lupin the 3rd: Dead or Alive

20. Interpretation: Nearest Neighbors 18 components • Find the nearest neighbors using Euclidean (q=2) distance q=11. American Beauty (1999) 2. Fight Club (1999) 3. Reservoir Dogs (1992)4. Mystic River (2003) q=21. American Beauty (1999)2. Fight Club (1999) 3. Mystic River (2003)4. Reservoir Dogs (1992) q=31. American Beauty (1999)2. Mystic River (2003) 3. Fight Club (1999)4. Traffic (2000)q=41. American Beauty (1999)2. Mystic River (2003)3. Fight Club (1999)4. Traffic (2000)

21. Demo – Name a movie! • http://www.caffeinatedrook.com/MovieRec/MovieRecServlet