Download
data mining algorithms for recommendation systems n.
Skip this Video
Loading SlideShow in 5 Seconds..
Data Mining Algorithms for Recommendation Systems PowerPoint Presentation
Download Presentation
Data Mining Algorithms for Recommendation Systems

Data Mining Algorithms for Recommendation Systems

201 Views Download Presentation
Download Presentation

Data Mining Algorithms for Recommendation Systems

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Data Mining Algorithms for Recommendation Systems Zhenglu Yang University of Tokyo

  2. Sample Applications

  3. Sample Applications

  4. Corporate Intranets Sample Applications

  5. System Inputs • Interaction data (users items) • Explicit feedback – rating, comments • Implicit feedback – purchase, browsing • User/Item individual data • User side: • Structural attribute information • Personal description • Social network • Item side: • Structural attribute information • Textual description/content information • Taxonomy of item (category)

  6. Interaction between Users and Items Observed preferences (Purchases, Ratings, page views, bookmarks, etc)

  7. Profiles of Users and Items User Profile: (1) Attribute Nationality,Sex,Age,Hobby,etc (2) Text Personal description (3) Link Social network Item Profile: (1) Attribute Price,Weight,Color,Brand,etc (2) Text Product description (3) link Taxonomy of item (category)

  8. All Information about Users and Items User Profile: (1) Attribute Nationality,Sex,Age,Hobby,etc (2) Text Personal description (3) Link Social network Item Profile: (1) Attribute Price,Weight,Color,Brand,etc (2) Text Product description (3) link Taxonomy of item (category) Observed preferences (Purchases, Ratings, page views, bookmarks, etc)

  9. KDD and Data Mining Machine learning Artificial Intelligence KDD Database Statistics Natural Language Processing Data mining is a multi-disciplinary field

  10. Recommendation Approaches • Collaborative filtering • Using interaction data (user-item matrix) • Process: Identify similar users, extrapolate from their ratings • Content based strategies • Using profiles of users/items (features) • Process: Generate rules/classifiers that are used to classify new items • Hybrid approaches

  11. A Brief Introduction • Collaborative filtering • Nearest neighbor based • Model based

  12. Recommendation Approaches • Collaborative filtering • Nearest neighbor based • User based • Item based • Model based

  13. User-based Collaborative Filtering • Idea: People who agreed in the past are likely to agree again • To predict a user’s opinion for an item, use the opinion of similar users • Similarity between users is decided by looking at their overlap in opinions for other items

  14. User-based CF (Ratings) 10 9 …… 2 1 good bad

  15. Similarity between Users • Only consider items both users have rated • Common similarity measures: • Cosine similarity • Pearson correlation

  16. Recommendation Approaches • Collaborative filtering • Nearest neighbor based • User based • Item based • Model based • Content based strategies • Hybrid approaches

  17. Item-based Collaborative Filtering • Idea: a user is likely to have the same opinion for similar items • Similarity between items is decided by looking at how other users have rated them

  18. Example: Item-based CF

  19. Similarity between Items • Only consider users who have rated both items • Common similarity measures: • Cosine similarity • Pearson correlation

  20. Recommendation Approaches • Collaborative filtering • Nearest neighbor based • Model based • Matrix factorization (i.e., SVD) • Content based strategies • Hybrid approaches

  21. Singular Value Decomposition (SVD) • Mathematical method used to apply for many problems • Given any mxn matrix R, find matrices U,I, and V that R = UIVT U is mxr and orthonormal I is rxr and diagonal V is nxr and orthonormal • Remove the smallest values to get Rm,k with k<<r • Rm,k is an k-approximation of R based on k most important latent features • Recommendation is be given based on Rm,k T I1 0 0 0 0 0 0 Ir R = U V …

  22. Problems with Collaborative Filtering • Cold Start: There needs to be enough other users already in the system to find a match. • Sparsity: If there are many items to be recommended, even if there are many users, the user/ratings matrix is sparse, and it is hard to find users that have rated the same items. • First Rater: Cannot recommend an item that has not been previously rated. • New items • Esoteric items • Popularity Bias: Cannot recommend items to someone with unique tastes. • Tends to recommend popular items.

  23. Recommendation Approaches • Collaborative filtering • Content based strategies • Hybrid approaches

  24. Profiles of Users and Items User Profile: (1) Attribute Nationality,Sex,Age,Hobby,etc (2) Text Personal description (3) Link Social network Item Profile: (1) Attribute Price,Weight,Color,Brand,etc (2) Text Product description (3) link Taxonomy of item (category)

  25. Advantages of Content-Based Approach • No need for data on other users. • No cold-start or sparsity problems. • Able to recommend to users with unique tastes. • Able to recommend new and unpopular items • No first-rater problem. • Can provide explanations of recommended items by listing content-features that caused an item to be recommended.

  26. Recommendation Approaches • Collaborative filtering • Content based strategies • Association Rule Mining • Text similarity based • Clustering • Classification • Hybrid approaches

  27. Traditional Data Mining Techniques • Association Rule Mining • Sequential Pattern Mining

  28. Example: Market Basket Data • Items frequently purchased together: • Uses: • Recommendation • Placement • Sales • Coupons • Objective: increase sales and reduce costs beer diaper

  29. Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Beer}  {Diaper},{Coke}  {Eggs}, Implication means co-occurrence, not causality!

  30. Some Definitions An itemset is supported by a transaction if it is included in the transaction Market-Basket transactions <Beer, Diaper> is supported by transaction 1, and 3, and its support is 2/4=50%.

  31. Some Definitions If the support of an itemset exceeds user specified min_support (threshold), this itemset iscalled afrequent itemset (pattern). Market-Basket transactions min_support=50% <Beer, Diaper> is a frequent itemset <Beer, Milk> is not a frequent itemset

  32. Outline • Association Rule Mining • Apriori • FP-growth • Sequential Pattern Mining

  33. Apriori Algorithm • Proposed by Agrawal et al. [VLDB’94] • First algorithm for Association Rule mining • Candidate generation-and-test • Introduced anti-monotone property

  34. Apriori Algorithm Market-Basket transactions

  35. A Naive Algorithm 2 3 3 1 3 1 1 2 1 1 ……. 1 Supmin=2

  36. Apriori Algorithm • Anti-monotone property: If an itemset is not frequent, then any of its superset is not frequent 3 3 2 3 1 2 1 1 ……. Supmin=2

  37. Apriori Algorithm Supmin = 2 L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan

  38. Drawbacks of Apriori • Multiple scans of transaction database • Multiple database scans are costly • Huge number of candidates • To find frequent itemset i1i2…i100 • # of scans: 100 • # of Candidates: 2100-1 = 1.27*1030

  39. Outline • Association Rule Mining • Apriori • FP-growth • Sequential Pattern Mining

  40. FP-Growth • Proposed by Han et al. [SIGMOD’00] • Uses the Apriori pruning principle • Scan DB only twice • Once to find frequent 1-itemset (single item pattern) • Once to construct FP-tree (prefix tree, Trie), the data structure of FP-growth

  41. FP-Growth TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o, w} 400 {b, c, k, s, p} 500{a, f, c, e, l, p, m, n} {} f:1 Header Table Item Support f 4 c 4 a 3 b 3 m 3 p 3 TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p} c:1 a:1 m:1 p:1

  42. FP-Growth TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o, w} 400 {b, c, k, s, p} 500{a, f, c, e, l, p, m, n} {} f:4 c:1 Header Table Item Support f 4 c 4 a 3 b 3 m 3 p 3 TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p} c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1

  43. FP-Growth {} Header Table Item support head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p}

  44. FP-Growth {} Header Table Item support head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Conditional pattern bases Item cond. pattern base freq. itemset p fcam:2, cb:1 fp, cp, ap, mp, fcp, fap, fmp, cap, cmp, amp, facp, fcmp, famp, fcamp m fca:2, fcab:1 fm, cm, am, fcm, fam, cam, fcam b fca:1, f:1, c:1 … a fc:3 … c f:3 …

  45. Outline • Association Rule Mining • Apriori • FP-growth • Sequential Pattern Mining • GSP • SPADE, SPAM • PrefixSpan

  46. Applications • Customer shopping sequences Re-query Query 80% Q1 P1 Q2 P2 Note computer Memory CD-ROM within 3 days

  47. Some Definitions If the support of a sequence exceeds user specified min_support, this sequence iscalled asequential pattern. min_support=50% <bdb> is a sequential pattern <adc> is not a sequential pattern

  48. Outline • Association Rule Mining • Apriori • FP-growth • Sequential Pattern Mining • GSP • SPADE, SPAM • PrefixSpan

  49. GSP (Generalized Sequential Pattern Mining) • Proposed by Srikant et al. [EDBT’96] • Uses the Apriori pruning principle • Outline of the method • Initially, every item in DB is a candidate of length-1 • For each level (i.e., sequences of length-k) do • Scan database to collect support count for each candidate sequence • Generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori • Repeat until no frequent sequence or no candidate can be found

  50. Finding Length-1 Sequential Patterns Seq. ID Sequence min_sup =2 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> • Initial candidates: • <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> • Scan database once, count support for candidates