Collaborative Filtering 101. Adnan Masood www.AdnanMasood.com. About Me aka. Shameless Self Promotion. Sr. Software Engineer / Tech Lead for Green Dot Corp. (Financial Institution) Design and Develop Connected Systems
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Applications in Information Security solve?(AT&T Hancock)
Reference: The Netflix Prize by James Bennett Stan Lanning, KDDCup’07, August 12, 2007, San Jose, California, USA.
Reference: Gillic et al, 2006 – Stanford University
The technique uses individual user distributions to measure distance between users, then makes predictions r(ui, mj) based on the ratings given mjby users near ui. The intuition here is that if many users rate two movies the same, the movies should be considered similar. Conversely, if many users rate two movies differently, the movies should be considered different. (Don Gillick, UC Berkley)
Calculate the "similarity" between each user by comparing how each user has rated common content. If Frank has rated something 4/5 stars and Jane has also rated it 4/5 stars, then these users would be considered similar. These calculations are very time consuming as it essentially becomes the "handshake" problem. I.e. the calculation has to be performed for each unique combination of users. The number of unique combinations is: n (n - 1) / 2. For the Netflix challenge, the number of unique combinations is 115,290,497,766...yes that's 115 billion
Step 1: Content bases survey classification.
Now the new user rates a new movie for X1 = 3 for X2 = 7. Without another expensive survey, can we guess what the classification of this new movie is?
1. Determine parameter K = number of nearest neighbors
Suppose use K = 3
Step 2: Calculate the distance between the query-instance and all the training samples
Coordinate of query instance is (3, 7), instead of calculating the distance we compute square distance which is faster to calculate (without square root)
Step 4. Gather the category of the nearest neighbors. Notice in the second row last column that the category of nearest neighbor (Y) is not included because the rank of this data is more than 3 (=K).
Step 5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance. We have 2 “Not likely to be seen by an action fan” and 1 “Likely to be seen by an action fan”, since 2>1 then we conclude that a new movie with X1 = 3 and X2 = 7 is included in Former category.
The Reviewer – Movie - Rating Matrix
1 2 3 4A 5 4 2 6
B 3 7 5 2
C 6 4 1 4
D ? ? ? ?
A = U * W * V^T
W is the main component for Principle components and identifies
14.49 0.00 0.00 0.00
0.00 4.93 0.00 0.00
0.00 0.00 1.65 0.00
2 = U41 S11 V11
We solve for U1. To predict R 2 R 3 R 4 , & we substitute U1 into the above equation we get.
P = [2 2.1554 1.1577 1.7312]
R1 = U41 S11 V11 + U42 S22 V12
R2 = U41 S11 V21 + U42 S22 V22
By solving for bothU1 andU2 , we can recalculate the predictions.
P = [2 7 5.3660 1.0166]
Similar to B [3 7 5 2]
What's new in BI for SQL Server 2008
Microsoft Association AlgorithmMicrosoft Clustering AlgorithmMicrosoft Decision Trees AlgorithmMicrosoft Naive Bayes AlgorithmMicrosoft Neural Network Algorithm (SSAS)Microsoft Sequence Clustering AlgorithmMicrosoft Time Series AlgorithmMicrosoft Linear Regression AlgorithmMicrosoft Logistic Regression Algorithm
The following query retrieves report data indicating which customers are likely to purchase a bicycle, and the probability that they will do so.