1 / 49

CSC-480 Data Mining

This lecture provides an introduction to association rule mining, including the basics, motivation, and real-life examples. It explains the Apriori algorithm and FP-Growth algorithm for mining frequent itemsets and generating association rules.

teresaj
Download Presentation

CSC-480 Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSC-480 Data Mining Lecture 03 – Association Rule Mining Muhammad Tariq Siddique https://sites.google.com/site/mtsiddiquecs/dm

  2. Gentle Reminder “Switch Off” your Mobile Phone Or Switch Mobile Phone to “Silent Mode”

  3. Agenda

  4. Agenda

  5. The Basics Which items are frequently purchased together by customers

  6. The Basics • Motivation: Business transaction records • Discovery of interesting correlation relationships that help business decision-making processes (catalog design, cross-marketing, customer shopping behavior analysis, …)

  7. The Basics • How to place SW, HW, and Accessories?

  8. Future Store

  9. A Real Example

  10. The Basics Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  11. The Basics - Frequent Itemsets Transaction Dataset Itemset Occurrence Frequency Frequent Itemset

  12. The Basics - Association Rules

  13. The Basics - Association Rules # • If frequency of itemsetI satisfies min_support count then I is a frequent itemset • If a rule satisfies min_supportand min_confidence thresholds, it is said to be strong • problem of mining association rules reduced to mining frequent itemsets • Association rules mining becomes a two-step process: • Find all frequent itemsetswith frequently ≥ a predetermined min_support count • Generate strong association rules from the frequent itemsets that satisfy min_supportand min_confidence % % Most costly Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  14. Agenda Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  15. Mining Frequent ItemsetsApriori Algorithm • Finds frequent itemsets by exploiting prior knowledge of frequent itemsetproperties • level-wise search, where k-itemsets are used to explore k +1-itemsets • Goes as follows: • Find frequent 1-itemsets  L1 • Use L1 to find frequent 2-itemsets  L2 • … until no more frequent k-itemsets can be found • Each Lkitemset requires a full dataset scan • To improve efficiency, use the Apriori property: • “All nonempty subsets of a frequent itemset must also be frequent” – if a set cannot pass a test, all of its supersets will fail the same test as well – if P(I) < min_support then P(I  A) < min_support

  16. Mining Frequent ItemsetsApriori Algorithm Scan dataset for count of each candidate Compare candidate support with min_supp Transactional data example N=9, min_supp count=2 C1 L1 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  17. Mining Frequent ItemsetsApriori Algorithm Compare candidate support with min_supp C2 C2 L2 Scan dataset for count of each candidate Generate C2 candidates from L1 by joining L1 L1

  18. Mining Frequent ItemsetsApriori Algorithm C3= L2 L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}} Scan dataset for count of each candidate Compare candidate support with min_supp Not all subsets are frequent Prune(Apriori property) C3 L3 Two joining (lexicographically ordered) k-itemsets must share first k-1 items  {I1, I2} is not joined with {I2, I4} Generate C3 candidates from L2by joiningL2 L2 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  19. Mining Frequent ItemsetsApriori Algorithm Not all subsets are frequent Prune C4 =   Terminate

  20. The AprioriAlgorithm—Exercise Database TDB Supmin = 2

  21. The AprioriAlgorithm—Exercise (Solution) Supmin = 2 Database TDB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan C4 =   Terminate

  22. Mining Frequent ItemsetsApriori Algorithm Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  23. Apriori Algorithm Generate Ck using Lk-1 to find Lk Join Prune

  24. Mining Frequent ItemsetsGenerating Association Rules from Frequent Itemsets Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  25. Mining Frequent ItemsetsGenerating Association Rules from Frequent Itemsets For a min_confidence= 70% Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  26. Mining Frequent ItemsetsFP-Growth • To avoid costly candidate generation • Divide-and-conquer strategy: • Compressdatabase representing frequent items into a frequent pattern tree (FP-tree) – 2 passes over dataset • Divide compressed database (FP-tree) into conditional databases, then mine each for frequent itemsets – traverse through the FP-tree Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  27. Mining Frequent ItemsetsFP-Growth Scan dataset for count of each candidate Compare candidate support with min_supp Transactional data example N=9, min_supp count=2 L1 - Reordered C1 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  28. Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree null { } L1 - Reordered Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  29. Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree null { } I2:1 I1:1 I5:1 T100 L1 - Reordered Order of items is kept throughout path construction, with common prefixes shared whenever applicable

  30. Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree null { } I2:1 I4:1 I1:1 I5:1 L1 - Reordered T200 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  31. Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree null { } I2:2 I4:1 I1:1 I5:1 L1 - Reordered T200 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  32. Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree null { } I2:2 I4:1 I1:1 I5:1 I3:1 L1 - Reordered T300 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  33. Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree null { } I2:3 I4:1 I1:1 I5:1 I3:1 L1 - Reordered T300 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  34. Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree Trace the node link path for each node entry and you get that item’s support count  null { } I1:2 I2:7 I3:2 I4:1 I3:2 I1:4 I4:1 I3:2 I5:1 I5:1 L1 - Reordered For Tree Traversal Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  35. Mining Frequent ItemsetsFP-Growth – Frequent Patterns Mining FP-tree Bottom-up algorithm – start from leaves and go up to root – I5 for example has two paths to root I5:1 I5:1 I3:2 I4:1 I1:4 I3:2 I3:2 I2:7 I1:2 null { } I4:1 L1 - Reordered {I3, I5} frequency < min_supportcount threshold Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  36. Mining Frequent ItemsetsFP-Growth – Conditional FP-tree Construction FP-tree For I5 null { } L1 - Reordered Eliminate transactions not including I5 Eliminate I5 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  37. Mining Frequent ItemsetsFP-Growth – Conditional FP-tree Construction FP-tree For I5 null { } I2:1 I1:1 L1 - Reordered Eliminate transactions not including I5 Eliminate I5 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  38. Mining Frequent ItemsetsFP-Growth – Conditional FP-tree Construction FP-tree For I5 null { } I2:2 I1:2 I3:1 L1 - Reordered Eliminate transactions not including I5 Eliminate I5 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  39. Mining Frequent ItemsetsFP-Growth Paths to which item is suffix Prefix paths to item after eliminating infrequent items Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  40. Mining Frequent ItemsetsFP-Growth – Conditional FP-tree Construction FP-tree For I4 null { } I2:2 I1:1 L1 - Reordered Eliminate transactions not including I4 Eliminate I4 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  41. Mining Frequent ItemsetsFP-Growth – Conditional FP-tree Construction FP-tree For I3 null { } I1:2 I2:4 I1:2 L1 - Reordered Eliminate transactions not including I3 Eliminate I3 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  42. Mining Frequent ItemsetsFP-Growth Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  43. Construct FP-Tree Exercise TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o, w} 400 {b, c, k, s, p} 500{a, f, c, e, l, p, m, n} min_support = 3

  44. {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Construct FP-Tree Exercise (Solution) TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 3 • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order, f-list • Scan DB again, construct FP-tree F-list = f-c-a-b-m-p

  45. Agenda Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  46. Pattern Evaluation Methods • Not all association rules are interesting • buys(X, “computer games”) buys(X, “videos”) [40%, 66%] • P(“videos”) is already 75% > 66% • The two items are negatively associated  buying one decreases the likelihood of buying the other • We need to measure “real strength” of rule • Correlation analysis Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  47. Pattern Evaluation Methods • A and B are independent if = • Otherwise, dependent and correlated occurrence • If < 1, A is negatively correlated with B • If > 1, A is positively correlated with B A’s occurrence “lifts” the occurrence of B • χ2  already discussed in a previous lecture Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

  48. References • Jiawei HanandMichelineKamber, Data Mining:Concepts and TechniquesThird Edition, Elsevier, 2012 • Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: PracticalMachineLearning Toolsand Techniques3rd Edition, Elsevier, 2011 • Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining Use Cases and Business Analytics Applications, CRC Press Taylor & Francis Group, 2014 • Daniel T. Larose, Discovering Knowledgein Data: an Introductionto DataMining, John Wiley & Sons, 2005 • EthemAlpaydin, Introduction to Machine Learning, 3rd ed., MIT Press, 2014 • Florin Gorunescu, Data Mining: Concepts, Models and Techniques, Springer, 2011 • OdedMaimonandLiorRokach, Data Mining and Knowledge Discovery Handbook Second Edition, Springer, 2010 • Warren Liao and EvangelosTriantaphyllou (eds.), Recent Advances in Data Mining of Enterprise Data: Algorithmsand Applications, World Scientific, 2007

  49. Q&A

More Related