CSC-480 Data Mining

CSC-480 Data Mining Lecture 03 – Association Rule Mining Muhammad Tariq Siddique https://sites.google.com/site/mtsiddiquecs/dm

Gentle Reminder “Switch Off” your Mobile Phone Or Switch Mobile Phone to “Silent Mode”

Agenda

The Basics Which items are frequently purchased together by customers

The Basics • Motivation: Business transaction records • Discovery of interesting correlation relationships that help business decision-making processes (catalog design, cross-marketing, customer shopping behavior analysis, …)

The Basics • How to place SW, HW, and Accessories?

Future Store

A Real Example

The Basics Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

The Basics - Frequent Itemsets Transaction Dataset Itemset Occurrence Frequency Frequent Itemset

The Basics - Association Rules

The Basics - Association Rules # • If frequency of itemsetI satisfies min_support count then I is a frequent itemset • If a rule satisfies min_supportand min_confidence thresholds, it is said to be strong • problem of mining association rules reduced to mining frequent itemsets • Association rules mining becomes a two-step process: • Find all frequent itemsetswith frequently ≥ a predetermined min_support count • Generate strong association rules from the frequent itemsets that satisfy min_supportand min_confidence % % Most costly Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Agenda Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsApriori Algorithm • Finds frequent itemsets by exploiting prior knowledge of frequent itemsetproperties • level-wise search, where k-itemsets are used to explore k +1-itemsets • Goes as follows: • Find frequent 1-itemsets  L1 • Use L1 to find frequent 2-itemsets  L2 • … until no more frequent k-itemsets can be found • Each Lkitemset requires a full dataset scan • To improve efficiency, use the Apriori property: • “All nonempty subsets of a frequent itemset must also be frequent” – if a set cannot pass a test, all of its supersets will fail the same test as well – if P(I) < min_support then P(I  A) < min_support

Mining Frequent ItemsetsApriori Algorithm Scan dataset for count of each candidate Compare candidate support with min_supp Transactional data example N=9, min_supp count=2 C1 L1 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsApriori Algorithm Compare candidate support with min_supp C2 C2 L2 Scan dataset for count of each candidate Generate C2 candidates from L1 by joining L1 L1

Mining Frequent ItemsetsApriori Algorithm C3= L2 L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}} Scan dataset for count of each candidate Compare candidate support with min_supp Not all subsets are frequent Prune(Apriori property) C3 L3 Two joining (lexicographically ordered) k-itemsets must share first k-1 items  {I1, I2} is not joined with {I2, I4} Generate C3 candidates from L2by joiningL2 L2 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsApriori Algorithm Not all subsets are frequent Prune C4 =   Terminate

The AprioriAlgorithm—Exercise Database TDB Supmin = 2

The AprioriAlgorithm—Exercise (Solution) Supmin = 2 Database TDB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan C4 =   Terminate

Mining Frequent ItemsetsApriori Algorithm Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Apriori Algorithm Generate Ck using Lk-1 to find Lk Join Prune

Mining Frequent ItemsetsGenerating Association Rules from Frequent Itemsets Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsGenerating Association Rules from Frequent Itemsets For a min_confidence= 70% Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth • To avoid costly candidate generation • Divide-and-conquer strategy: • Compressdatabase representing frequent items into a frequent pattern tree (FP-tree) – 2 passes over dataset • Divide compressed database (FP-tree) into conditional databases, then mine each for frequent itemsets – traverse through the FP-tree Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth Scan dataset for count of each candidate Compare candidate support with min_supp Transactional data example N=9, min_supp count=2 L1 - Reordered C1 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree null { } L1 - Reordered Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree null { } I2:1 I1:1 I5:1 T100 L1 - Reordered Order of items is kept throughout path construction, with common prefixes shared whenever applicable

Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree null { } I2:1 I4:1 I1:1 I5:1 L1 - Reordered T200 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree null { } I2:2 I4:1 I1:1 I5:1 L1 - Reordered T200 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree null { } I2:2 I4:1 I1:1 I5:1 I3:1 L1 - Reordered T300 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree null { } I2:3 I4:1 I1:1 I5:1 I3:1 L1 - Reordered T300 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth – FP-tree Construction FP-tree Trace the node link path for each node entry and you get that item’s support count  null { } I1:2 I2:7 I3:2 I4:1 I3:2 I1:4 I4:1 I3:2 I5:1 I5:1 L1 - Reordered For Tree Traversal Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth – Frequent Patterns Mining FP-tree Bottom-up algorithm – start from leaves and go up to root – I5 for example has two paths to root I5:1 I5:1 I3:2 I4:1 I1:4 I3:2 I3:2 I2:7 I1:2 null { } I4:1 L1 - Reordered {I3, I5} frequency < min_supportcount threshold Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth – Conditional FP-tree Construction FP-tree For I5 null { } L1 - Reordered Eliminate transactions not including I5 Eliminate I5 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth – Conditional FP-tree Construction FP-tree For I5 null { } I2:1 I1:1 L1 - Reordered Eliminate transactions not including I5 Eliminate I5 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth – Conditional FP-tree Construction FP-tree For I5 null { } I2:2 I1:2 I3:1 L1 - Reordered Eliminate transactions not including I5 Eliminate I5 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth Paths to which item is suffix Prefix paths to item after eliminating infrequent items Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth – Conditional FP-tree Construction FP-tree For I4 null { } I2:2 I1:1 L1 - Reordered Eliminate transactions not including I4 Eliminate I4 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth – Conditional FP-tree Construction FP-tree For I3 null { } I1:2 I2:4 I1:2 L1 - Reordered Eliminate transactions not including I3 Eliminate I3 Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Mining Frequent ItemsetsFP-Growth Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Construct FP-Tree Exercise TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o, w} 400 {b, c, k, s, p} 500{a, f, c, e, l, p, m, n} min_support = 3

{} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Construct FP-Tree Exercise (Solution) TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 3 • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order, f-list • Scan DB again, construct FP-tree F-list = f-c-a-b-m-p

Agenda Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Pattern Evaluation Methods • Not all association rules are interesting • buys(X, “computer games”) buys(X, “videos”) [40%, 66%] • P(“videos”) is already 75% > 66% • The two items are negatively associated  buying one decreases the likelihood of buying the other • We need to measure “real strength” of rule • Correlation analysis Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

Pattern Evaluation Methods • A and B are independent if = • Otherwise, dependent and correlated occurrence • If < 1, A is negatively correlated with B • If > 1, A is positively correlated with B A’s occurrence “lifts” the occurrence of B • χ2  already discussed in a previous lecture Data Mining 2013 – Mining Frequent Patterns, Association, and Correlations

References • Jiawei HanandMichelineKamber, Data Mining:Concepts and TechniquesThird Edition, Elsevier, 2012 • Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: PracticalMachineLearning Toolsand Techniques3rd Edition, Elsevier, 2011 • Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining Use Cases and Business Analytics Applications, CRC Press Taylor & Francis Group, 2014 • Daniel T. Larose, Discovering Knowledgein Data: an Introductionto DataMining, John Wiley & Sons, 2005 • EthemAlpaydin, Introduction to Machine Learning, 3rd ed., MIT Press, 2014 • Florin Gorunescu, Data Mining: Concepts, Models and Techniques, Springer, 2011 • OdedMaimonandLiorRokach, Data Mining and Knowledge Discovery Handbook Second Edition, Springer, 2010 • Warren Liao and EvangelosTriantaphyllou (eds.), Recent Advances in Data Mining of Enterprise Data: Algorithmsand Applications, World Scientific, 2007

Q&A

CSC-480 Data Mining

CSC-480 Data Mining

Presentation Transcript

CSC 480: Artificial Intelligence -- Search Algorithms --

Data Mining: Data

Data Mining: Data

CSC 480 – Software Engineering

CSC 480 - Multiprocessor Programming, Spring, 2012

CSC 480 Software Engineering

Data Mining: Data

CSC 480 Software Engineering

CSC 480 Software Engineering

CSC 480 Software Engineering

CSC 478 Programming Data Mining Applications Course Summary

CSC 480 Software Engineering

CSC 480 Software Engineering

Data Mining: Data

CSC 480 Software Engineering

CSC 480 Software Engineering

CSC 480 - Multiprocessor Programming, Spring, 2012

CSC 480: Artificial Intelligence

CSC 480 - Multiprocessor Programming, Spring, 2012

Data Mining: Data