Collaborative Filtering 101

Collaborative Filtering 101 Adnan Masood www.AdnanMasood.com

About Meaka. Shameless Self Promotion • Sr. Software Engineer / Tech Lead for Green Dot Corp. (Financial Institution) • Design and Develop Connected Systems • Involved with SoCal Dev community, co-founded San Gabriel Valley .NET Developers Group. Published author and speaker. • MS. Computer Science, MCPD (Enterprise Developer), MCT, MCSD.NET • Doctoral Student - Areas of Interest: Machine learning, Bayesian Inference, Data Mining, Collaborative Filtering, Recommender Systems. • Contact at adnanmasood@acm.org • Read my Blog at www.AdnanMasood.com • Doing a session in IASA 2008 in San Francisco on Aspect Oriented Programming; for details visit http://www.iasaconnections.com

AgendaWhat this Presentation Covers? • Defines Collaborative Filtering and it’s use in Recommendation Systems. • Background and Current State of the Applications on Collaborative Filtering Algorithms and their Feature set. • Illustrative implementation of the Algorithms with example. • Results on the large dataset via different Algorithms. • Recommendations on what to use when doing collaborative filtering on large scale dataset. • Overview of SQL Server BI and Prediction Engine

Recommender Systems Zeitgeist

What is Collaborative Filtering and What problem does it solve? • Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. Such reactions may be that a document was particularly interesting (or particularly uninteresting). These reactions, more generally called annotations, can be accessed by others’ filters.” -Communications of the ACM – Dec. 1992 • Collaborative Filtering (CF) finds items of interest to a user based on the preferences of other similar users. Assumes that human behavior is predictable. • Recommender Systems (or recommenders) suggest items of interest based on a user’s preferences, behavior and information about the items themselves -Recommenders Everywhere – WikiSym ’07, ACM • With the large amounts of data generated in the e-commerce systems, the classical methods of recommendation are insufficient and cannot handle information overload. The modern automated recommendation systems are built using Collaborative filtering to help dealing with large scale datasets. • Information overload problem - 20K movies Netflix, 250K songs on Yahoo Music, Total number of books on Amazon? • First ACM Recommender System Conference in October 19-20, 2007 -- Minneapolis, Minnesota, USA by SIGCHI

Types of Recommendation Systems • Recommender systems use the opinions of a community of users to help individuals in that community more effectively identify content of interest from a potentially overwhelming set of choices [Resnick and Varian 1997].

Applications • Search • Social Networking • Product Recommendations • Demographic Targeted Advertisements • Fraud Detection • Pattern Detection / Clustering • Security • Firewall outlier analysis • Text Mining Outliers

Applications in Information Security(AT&T Hancock)

Major Challenges in Recommender System Design • Scalability • Real-time Analysis and Prediction • Performance • Accuracy • Robustness • Growing Area of Research in KDD, Machine Learning and AI

Issues and Future Research Directions • K-NN Optimization • Explainability(D. Billsus and M. Pazzani, “A Personal News Agent that Talks, Learns and Explains,” Proc. Third Ann. Conf. Autonomous Agents, 1999.) • Hybrid Algorithms between Memory based and Model based techniques. [Pennock, David M. and Horvitz, Eric 1999] • Cold Start Problems (A.I. Schein, A. Popescul, L.H. Ungar, and D.M. Pennock, “Methods and Metrics for Cold-Start Recommendations,” Proc. 25th Ann. Int’l ACM SIGIR Conf., 2002.) • Privacy (N. Ramakrishnan, B.J. Keller, B.J. Mirza, A.Y. Grama, and G. Karypis, “Privacy Risks in Recommender Systems,” IEEE Internet Computing, vol. 5, no. 6, pp. 54-62, Nov./Dec. 2001.) • Error Method with Look Ahead • Boltzman Machines • Vertical Niche Markets

Popular Recommendation Systems • Lotus Notes [Turnbull, 1998] • Mosaic system [Turnbull,1997] • PHOAKS (People Helping One Another Know Stuff) [Terveen et. al, 1997] • Pointers [Maltz, 1995] • Siteseer [Turnbull, 1997] • Tapestry [Goldberg, 1992]. • Yahoo [Turnbull, 1998] • The WebWatcher system [Joachims, 1996] • Do-I-Care [Turnbull, 1998; Collaborative Filtering workshop, 1996] • Fab recommendation system [Turnbull, 1998] • Firefly [Turnbull, 1997 and 1998] • GAB (group asynchronous browsing) [Wittenburg, et. al., 1998] • Grassroots system [Turnbull, 1998] • Resnick [Resnick, et al. 1994] • Let's browse/ Letizia, [Lieberman, 1996; Pryor, 1998]

Classification of Collaborative Filtering Algorithms • A popular classification of CF algorithms was proposed by Breese et al (Convergent algorithms for collaborative filtering, Proceedings of the 4th ACM conference on Electronic commerce) into Memory-based and Model-based methods. • Memory-Based methods work on the principal of aggregating the labeled data and attempt to match recommenders to those seeking recommendations. Most common memory-based methods works are based on the notion of nearest neighbor, using a variety of distance metrics. • Use the entire database of user ratings to make predictions. • Find users with similar voting histories to the active user. • Use these users’ votes to predict ratings for products not voted on by the active user • Model-based Methods, on the other hand, try to learn a compact model from the training data, for example learn parameters of a para-metric posterior distribution. From an operational point of view, memory-based methods potentially work with the entire training set and scale linearly with the amount of training data, while model-based methods are constant time. • Construct a model from the vote database. • Use the model to predict the active user’s ratings

Classification of Collaborative Filtering Algorithms • Memory-based Algorithm and Model-based Algorithms. (Breese, et.al.,1998) • Memory-based Algorithms • Mean Squared Differences • Pearson Correlation (Neighborhood based interpolation k-NN) • Vector Similarity • Model-Based Algorithms • Bayesian Network Models: • Neural Network Models (Boltzman Machines) • Other / Hybrid Algorithms • A hybrid memory- and model-based approach [Pennock, David M. and Horvitz, Eric 1999] • Singular Value Decomposition (SVD) • Probabilistic Latent Semantic Analysis

Algorithms and their Performance Reference: The Netflix Prize by James Bennett Stan Lanning, KDDCup’07, August 12, 2007, San Jose, California, USA.

Data Sets Netflix Database • There are 17770 movies. • There are 480189 users. • ustomerIDs range from 1 to 2649429, with gaps. • Ratings are on a five star (integral) scale from 1 to 5. • YearOfRelease range from 1890 to 2005. • Training set consists of 100 million records. Qualifying dataset size is 2817131. It contains from 1-9999 movies ids. Prediction needs to be submitted on this dataset. • Probe dataset size is 1408395. It contains from 1-9999 movies ids. This dataset is meant to be used for checking the rmse before proceeding for qualifying dataset prediction. • Download Linkhttp://www.netflixprize.com/download MovieLens Database • DataSet 1 Consists of 100,000 ratings for 1682 movies by 943 users. • The second one consists of approximately 1 million ratings for 3900 movies by 6040 users • Download Link: http://www.grouplens.org/node/73 • UCIrvine Datasets

Experiment Details and Methodologies • Hardware • Cluster of 3 P-IV Machines with ~2 GB RAM along with a remote desktop laptop (controller) • ~ 1TB Storage (with backups) • DataSet • Netflix DataSet • Netflix provides a large movie rating dataset consisting of over 100 million ratings (and their dates) from approximately 480,000 randomly-chosen users and 18,000 movies. The data were collected between October, 1998 and December, 2005 and represent the distribution of all ratings Netflix obtained during this time period. Given this dataset, the task is to predict the actual ratings of over 3 million unseen ratings from these same users over the same set of movies”. [Yew Jin Lim and Yee Whye The, “Variational Bayesian Approach to Movie Rating Prediction”, KDDCup.07 August 12, 2007, San Jose, California, USA] • Benchmarking • Matrix Calculated on time and accuracy (RMSE) results.

Averages and Mean Statistics Reference: Gillic et al, 2006 – Stanford University

K-Nearest NeighborHow does it work? The technique uses individual user distributions to measure distance between users, then makes predictions r(ui, mj) based on the ratings given mjby users near ui. The intuition here is that if many users rate two movies the same, the movies should be considered similar. Conversely, if many users rate two movies differently, the movies should be considered different. (Don Gillick, UC Berkley) Calculate the "similarity" between each user by comparing how each user has rated common content. If Frank has rated something 4/5 stars and Jane has also rated it 4/5 stars, then these users would be considered similar. These calculations are very time consuming as it essentially becomes the "handshake" problem. I.e. the calculation has to be performed for each unique combination of users. The number of unique combinations is: n (n - 1) / 2. For the Netflix challenge, the number of unique combinations is 115,290,497,766...yes that's 115 billion • 1. Monsters • 2. Shrek (Full-screen) • 3. Shrek 2 • 4. LOTR: The Two Towers • 5. Pirates of the • Caribbean: The Curse • of the Black Pearl • 6. The Incredibles • 7. The Sixth Sense • 8. The Shawshank • Redemption: Special • Edition • 9. LOTR: The Fellowship • of the Ring • 10. Forrest Gump • 1. LOTR: The Two Towers • 2. LOTR: The Return of the King • 3. LOTR: The Fellowship of the • Ring: Extended Edition • 4. LOTR: The Two Towers: • Extended Edition • 5. Raiders of the Lost Ark • 6. LOTR: The Return of the • King: Extended Edition • 7. Pirates of the Caribbean: • The Curse of the Black Pearl • 8. The Matrix • 9. The Shawshank Redemption: Special Edition • 10. Braveheart

K-Nearest NeighborHow does it work?

K-Nearest NeighborHow does it work? An Example Step 1: Content bases survey classification. Now the new user rates a new movie for X1 = 3 for X2 = 7. Without another expensive survey, can we guess what the classification of this new movie is? 1. Determine parameter K = number of nearest neighbors Suppose use K = 3

K-Nearest NeighborHow does it work? An Example (cont.) Step 2: Calculate the distance between the query-instance and all the training samples Coordinate of query instance is (3, 7), instead of calculating the distance we compute square distance which is faster to calculate (without square root)

K-Nearest NeighborHow does it work? An Example (cont.) • Step 3. Sort the distance and determine nearest neighbors based on the K-th minimum distance

K-Nearest NeighborHow does it work? An Example (cont.) Step 4. Gather the category of the nearest neighbors. Notice in the second row last column that the category of nearest neighbor (Y) is not included because the rank of this data is more than 3 (=K). Step 5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance. We have 2 “Not likely to be seen by an action fan” and 1 “Likely to be seen by an action fan”, since 2>1 then we conclude that a new movie with X1 = 3 and X2 = 7 is included in Former category.

Singular Value Decomposition (SVD) • The user rating vectors can be represented by a mn matrix A, with m users and n products, where is the rating of user for product . [Qu & Yang, 2000] • Through singular value decomposition, A can by factored into USVT , where U and V are orthogonal matrices and the S is a zero matrix, except for the diagonal entries which are defined as the singular value of A. • U is representative of the response of each user to certain features. • V is representative of the amount of each feature present in each product. • S is a matrix related to the feature importance in overall determination of the rating. The S matrix is a zero matrix, except for the diagonal entries which are defined as the singular values of A [Pryor, H. Michael,1998]

How does SVD work? An Example for inner workings of the Algorithm • Movies • Pulp Fiction: The movie has excellent cinematic value and storyline but has long dialogues and conversation sequences. • From Dusk Till Dawn: The movie has lots of action, decent storyline and gets to the point fairly quick but isn't a cinematic magic. • The Big Lebowski: Low budget but with excellent dialogues and quite artistic niche. Not the best cinema work and continuity. • Children of Men: Excellent cinematography but rather long story line, sometimes not keeping the user captivated. Not of artistic value. • Reviewer • Andrea the Action fan - likes action, short and well put together movies. Long stories artsy stuff does not typically attract her but always appreciates good cinematography. • Arthur the Art Lover - Loves niche movies but also appreciates action; does not mind long movies as long as they have good artistic value. • Dave the director - A film school graduate who loves action, good camera work, story line and dialogues. Not a big art fan. • Jim the average movie guy - Likes action and thrillers but detest long movies.

How does SVD work? The Reviewer – Movie - Rating Matrix 1 2 3 4A 5 4 2 6 B 3 7 5 2 C 6 4 1 4 D ? ? ? ? A = U * W * V^T

How does SVD work? Predicting what a new user would like W is the main component for Principle components and identifies 14.49 0.00 0.00 0.00 0.00 4.93 0.00 0.00 0.00 0.00 1.65 0.00 • Now imagine that Jim rated the first movie 2 Rd=Ui1S11Vj1 2 = U41 S11 V11 We solve for U1. To predict R 2 R 3 R 4 , & we substitute U1 into the above equation we get. P = [2 2.1554 1.1577 1.7312] • Now he has rated the second movie 7 R1 = U41 S11 V11 + U42 S22 V12 R2 = U41 S11 V21 + U42 S22 V22 By solving for bothU1 andU2 , we can recalculate the predictions. P = [2 7 5.3660 1.0166] Similar to B [3 7 5 2]

Recommendations for Large Scale Recommender Systems • There is no silver-bullet. The BellKor solution to the Netflix Prize used modified k-NN and the final solution (RMSE=0.8712) consists of blending 107 individual results. • Occam’s Razor – Simplicity is good on smaller scale. • Algorithms Performance on Accuracy (low to high) • Averages, Bayesian, Multinominal Distribution (Co-Variance), k-NN (Pearson Correlation), Singular Value Decomposition, Specialized Hybrid Techniques • Algorithms Performance on Time-Space (low to high) • Averages, Singular Value Decomposition, Specialized Hybrid Techniques, Multinominal Distribution (Co-Variance), Bayesian, k-NN (Pearson Correlation), • Algorithms Performance on Scalability (low to high) • Averages, k-NN (Pearson Correlation), Multinominal Distribution (Co-Variance), Specialized Hybrid Techniques, Bayesian, Singular Value Decomposition • Perform offline processing and cache the results regardless for maximum performance and scalability. • Build hybrid design to support the cold-start, privacy and content control. • Use adaptive models for better recommendations progressively.

SQL Server Data Mining What's new in BI for SQL Server 2008 Lynn Langit Room: 107 • www.SQLServerDataMining.com • http://www.microsoft.com/sql/technologies/dm/default.mspx • http://scis.nova.edu/~adnan/

SQL Server DM Algorithms Microsoft Association AlgorithmMicrosoft Clustering AlgorithmMicrosoft Decision Trees AlgorithmMicrosoft Naive Bayes AlgorithmMicrosoft Neural Network Algorithm (SSAS)Microsoft Sequence Clustering AlgorithmMicrosoft Time Series AlgorithmMicrosoft Linear Regression AlgorithmMicrosoft Logistic Regression Algorithm

SQL Server Prediction Queries The following query retrieves report data indicating which customers are likely to purchase a bicycle, and the probability that they will do so. • SELECT t.FirstName, t.LastName, (Predict ([Bike Buyer])) as [PredictedValue], (PredictProbability([Bike Buyer])) as [Probability] From [TM Decision Tree] PREDICTION JOIN OPENQUERY([Adventure Works DW], 'SELECT [FirstName], [LastName], [CustomerKey], [MaritalStatus], [Gender], [YearlyIncome], [TotalChildren], [NumberChildrenAtHome], [HouseOwnerFlag], [NumberCarsOwned], [CommuteDistance] FROM [dbo].[DimCustomer] ') AS t ON [TM Decision Tree].[Marital Status] = t.[MaritalStatus] AND [TM Decision Tree].[Gender] = t.[Gender] AND [TM Decision Tree].[Yearly Income] = t.[YearlyIncome] AND [TM Decision Tree].[Total Children] = t.[TotalChildren] AND [TM Decision Tree].[Number Children At Home] = t.[NumberChildrenAtHome] AND [TM Decision Tree].[House Owner Flag] = t.[HouseOwnerFlag] AND [TM Decision Tree].[Number Cars Owned] = t.[NumberCarsOwned] AND [TM Decision Tree].[Commute Distance] = t.[CommuteDistance] WHERE (Predict ([Bike Buyer]))=@Buyer AND (PredictProbability([Bike Buyer]))>@Probability

SQL Server Support for Prediction • SELECT FLATTENED TopCount(Predict([Invoice Detail], INCLUDE_STATISTICS), $AdjustedProbability, 5) FROM [assoc1] NATURAL PREDICTION JOIN ( SELECT 'Female' AS [Gender], 25 AS [Age], ( SELECT 'Mountain bottle cage' AS [Product Name] UNION SELECT 'Hydration pack -70oz' AS [Product Name] -- specify Gender, Marital Status, Income) AS [Invoice Detail] ) AS t

Questions?

Collaborative Filtering 101