Α. Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη και τις Αποθήκες Δεδομένων

Α. Νανόπουλος & Γ. ΜανωλόπουλοςΕισαγωγή στην Εξόρυξη & τις Αποθήκες ΔεδομένωνΚεφάλαιο 8: Κανόνες Συσχέτισηςhttp://delab.csd.auth.gr/books/grBooks/grBooks.html Α. Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη και τις Αποθήκες Δεδομένων 1

Associations • Rules expressing relationships between items • Example cereal, milk  fruit “People who bought cereal and milk also bought fruit.” Stores might want to offer specials on milk and cereal to get people to buy more fruit. Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Market Basket Analysis • Analyze tables of transactions • Can we hypothesize? • Chips => Salsa Lettuce => Spinach Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Definitions • Let I ={i1, …, im} a set of literals called items • Let D be a set of transactions • for each T  D, T  I • ItemsetX: any X I • k-itemset: |X| = k • X contained-by T, if X  T • support s(X): fraction of transactions containing X Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Rule Measures: Support and Confidence Customer buys both Customer buys diaper • Find all the rules Beer  Diaper with minimum confidence and support • support,s, probability that a transaction contains {Beer  Diaper} • confidence,c,conditional probability that a transaction having {Beer} also contains Diaper Customer buys beer Let minimum support 50%, and minimum confidence 50%, we have • A  C (50%, 66.6%) • C  A (50%, 100%) Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Association rules: Problem description • Association rule: X Y, X  I, Y  I, X  Y =  • Support of rule = s(X  Y) • Confidence of rule = s(X  Y) / s(X) • Problem: Find all association rules with confidence larger that MINCONF and support larger than MINSUP • Notice that: • if X  Y, Y  Z, then not necessarily X  Z because X  Z may not have enough support or confidence Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Mining Association Rules—An Example For rule AC: support = support({AC}) = 50% confidence = support({AC})/support({A}) = 66.6% Min. support 50% Min. confidence 50% Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Association Rule Mining • Two basic steps • Find all frequent itemsets • Satisfying minimum support • Find all strong association rules • Generate association rules from frequent itemsets • Keep rules satisfying minimum confidence Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Generating Frequent Itemsets: Naive algorithm n <- |D| for each subset s of I do l <- 0 for each transaction T in D do if s is a subset of T then l <- l + 1 if minimum support <= l/n then add s to frequent subsets Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Analysis of naive algorithm • O(2m) subsets of I • Scan n transactions for each subset • O(2mn) tests of s being subset of T • Growth is exponential in the number of items! • Can we do better? Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Reducing Number of Candidates: Apriori • Apriori principle: • If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle holds due to the following property of the support measure: • Support of an itemset never exceeds the support of its subsets • This is known as the anti-monotone property of support Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Illustrating Apriori Principle Found to be Infrequent Pruned supersets Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generatecandidates involving Cokeor Eggs) Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13 Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

The Apriori Algorithm • Join Step: Ckis generated by joining Lk-1with itself • Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk; Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert intoCk select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ckdo forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Example of Generating Candidates • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd} Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

How to Count Supports of Candidates? • Candidate counting: • Scan the database of transactions to determine the support of each candidate itemset • To reduce the number of comparisons, store the candidates in a hash structure • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Example • b=2, c=3 Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

The Apriori Algorithm — Example Database D L1 C1 Scan D C2 C2 L2 Scan D L3 C3 Scan D Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Another example of Apriori • Minsup = 2 Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Another example of Apriori Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Generating rules (2nd sub-problem) • Given a frequent itemset L, find all non-empty subsetsf  L such that f  L – f satisfies the minimum confidence requirement • If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC  BD, AD  BC, BC AD, BD AC, CD AB, • If |L| = k, then there are 2k – 2 candidate association rules (ignoring L   and   L) Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Rule Generation with anti-monotone property • How to efficiently generate rules from frequent itemsets? • In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) • But confidence of rules generated from the same itemset has an anti-monotone property • e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD) • Confidence is anti-monotone w.r.t. number of items on the RHS of the rule Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Pruned Rules Rule Generation: example of anti-monotonicity Lattice of rules Low Confidence Rule Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Methods to Improve Apriori’s Efficiency • Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans • Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB • Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness • Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Hash-based method • |C2| = comb(|L1|, 2) = O(|L1|2) • |L1| maybe thousands, |C2| bottleneck • While counting the support of C1 • find the 2-itemsets in each transaction • hash them into buckets, accumulating counts • if a 2-itemset belongs to a bucket with count less than MINSUP, then it cannot be large Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Example of hash-based method C1 {A} (2) {B} (3) {C} (3) {D} (1) {E} (3) L1 {A} {B} {C} {E} L1 x L1 {A,B} {A,C} {A,E} {B,C} {B,E} {C,E} C2 {A,C} {A,E} {B,C} {B,E} {C,E} • ACD • BCE • 300 ABCE • 400 BE 3 {C,E} {C,E} {A,D} 3 {B,C} {B,C} {A,E} 0 3 {B,E} {B,E} {B,E} 1 {A,B} 3 {A,C} {C,D} {A,C} Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Transaction trimming • While counting Ck (k phase) • an item in a transaction t can be trimmed if it does not appear in at least k members of Ck (k-itemset candidates) • example: candidates ABC, ABD, BCD, and ACD is not • if t contains A, then A can be trimmed from t, because will not participate in forthcoming passes (ABCD will not be generated) • if t contains no more than k items, discard it Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Example of transaction trimming • ACD • BCE • 300 ABCE • 400 BE discard {B,C,E} {B,C,E} discard due to length {C} {B,C,E} {B,C,E} {B,E} C2 {A,C} {B,C} {B,E} {C,E} Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Partitioning method • Logically partitions the database into a number of non-overlapping partitions • each partition fits entirely in main memory • Finds the large itemsets in each partition (local large itemsets) • multiple phases are performed in main memory, I/O only to read the partition once • Determine which of the local large are also global large itemsets • in one scan of the database • no global large is missed Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Partition Algorithm partition_database(D) n = Number of partitions //Phase I fori=1 to n do read i -th partition Li = gen_all_large_itemsets(i-th partition) C = L1 …  Ln fori=1 to n do read i -th partition foreach candidate c  C update c.count L = {c | c  C, c.count  MINSUP} Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Partitioning example Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Why Partition is correct? • We have to show that no large itemset (global large) is missed • A global large itemset X is not missed if it is canidate in the 2nd phase, i.e., when its is local large in at least one partition • Let X be a global large which is not local large in any partition • t(X, Pi ) < MINSUP |Pi| 1  i  nΣt(X, Pi ) < MINSUP Σ|Pi|  t(X, D) < MINSUP  |D|  s(X) < MINSUP (contradiction) Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Size of candidate set for the 2nd phase • Based on the assumption that the size |C | of all local large itemsets is small enough • comparable to |L| (global large) • The same MINSUP is used, one can expect the same as |L| • at worst case n  |L| • in practice, many overlap (the same in different partitions) • For small n, |C | very close to |L| • For larger n, |C | larger but not the worst case Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Sampling method • Main idea • get a random sample s of database D • find large itemsets S in s • verify which of S are large in D (one scan) • identify potentially “missed” large (possibly a 2nd scan) • Gains • mining s is for free in terms of I/O • one scan of D in the best and two in the worst case Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Negative border Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

How not to miss anything? • Goal is to avoid missing any itemset that is frequent in the full set of baskets • If X is frequent in D, but not frequent in S, then there  Y  X, Y Bd-(S) • We check if something in the negative border is actually frequent in D • Try to choose Minsup in S so the probability of failure is low, while the number of itemsets checked on the second pass fits in main-memory Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Sampling algorithm Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Example • Domain = {A, B, C, D, E, F} • S = {A,B,C}, {A,C,F}, {A,D}, {B,D} • Bd-(S) = {B,F}, {C,D}, {D,F}, {E} • candidates not found large in sample • L = {A,B}, {B,F}, {A,C,F} large in D • {B,F} represents a “miss” • {A,B},{B,F} and {A,F} generate {A,B,F} which can possibly be large but was not count Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Is Apriori Fast Enough? — Performance Bottlenecks • The core of the Apriori algorithm: • Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets • Use database scan and pattern matching to collect counts for the candidate itemsets • The bottleneck of Apriori: candidate generation • Huge candidate sets: • 104 frequent 1-itemset will generate 107 candidate 2-itemsets • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100  1030 candidates. • Multiple scans of database: • Needs (n +1 ) scans, n is the length of the longest pattern Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Mining Frequent Patterns Without Candidate Generation • Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure • highly condensed, but complete for frequent pattern mining • avoid costly database scans • Develop an efficient, FP-tree-based frequent pattern mining method • A divide-and-conquer methodology: decompose mining tasks into smaller ones • Avoid candidate generation: sub-database test only! Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

{} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Construct FP-tree from a Transaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 0.5 • Steps: • Scan DB once, find frequent 1-itemset (single item pattern) • Order frequent items in frequency descending order • Scan DB again, construct FP-tree Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Mining Frequent Patterns Using FP-tree • General idea (divide-and-conquer) • Recursively grow frequent pattern path using the FP-tree • Method • For each item, construct its conditional pattern-base, and then its conditional FP-tree • Repeat the process on each newly created conditional FP-tree • Until the resulting FP-tree is empty, or it contains only one path(single path will generate all the combinations of its sub-paths, each of which is a frequent pattern) Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Major Steps to Mine FP-tree • Construct conditional pattern base for each node in the FP-tree • Construct conditional FP-tree from each conditional pattern-base • Recursively mine conditional FP-trees and grow frequent patterns obtained so far • If the conditional FP-tree contains a single path, simply enumerate all the patterns Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Step 1: From FP-tree to Conditional Pattern Base • Starting at the frequent header table in the FP-tree • Traverse the FP-tree by following the link of each frequent item • Accumulate all of transformed prefix paths of that item to form a conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

{} f:3 c:3 a:3 m-conditional FP-tree Step 2: Construct Conditional FP-tree • For each pattern-base • Accumulate the count for each item in the base • Construct the FP-tree for the frequent items of the pattern base {} • m-conditional pattern base: • fca:2, fcab:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1  a:3 p:1 m:2 b:1 p:2 m:1 Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty Mining Frequent Patterns by Creating Conditional Pattern-Bases Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Single FP-tree Path Generation • Suppose an FP-tree T has a single path P • The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P {} All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam f:3  c:3 a:3 m-conditional FP-tree Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Άλλο ένα παράδειγμα Minsup = 2 Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Άλλο ένα παράδειγμα Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Α. Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη και τις Αποθήκες Δεδομένων

Α. Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη και τις Αποθήκες Δεδομένων

Presentation Transcript