1 / 62

Α. Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη και τις Αποθήκες Δεδομένων

Α. Νανόπουλος & Γ. Μανωλόπουλος Εισαγωγή στην Εξόρυξη & τις Αποθήκες Δεδομένων Κεφάλαιο 8 : Κανόνες Συσχέτισης http://delab.csd.auth.gr/books/grBooks/grBooks.html. Α. Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη και τις Αποθήκες Δεδομένων. 1. Associations.

gefjun
Download Presentation

Α. Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη και τις Αποθήκες Δεδομένων

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Α. Νανόπουλος & Γ. ΜανωλόπουλοςΕισαγωγή στην Εξόρυξη & τις Αποθήκες ΔεδομένωνΚεφάλαιο 8: Κανόνες Συσχέτισηςhttp://delab.csd.auth.gr/books/grBooks/grBooks.html Α. Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη και τις Αποθήκες Δεδομένων 1

  2. Associations • Rules expressing relationships between items • Example cereal, milk  fruit “People who bought cereal and milk also bought fruit.” Stores might want to offer specials on milk and cereal to get people to buy more fruit. Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  3. Market Basket Analysis • Analyze tables of transactions • Can we hypothesize? • Chips => Salsa Lettuce => Spinach Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  4. Definitions • Let I ={i1, …, im} a set of literals called items • Let D be a set of transactions • for each T  D, T  I • ItemsetX: any X I • k-itemset: |X| = k • X contained-by T, if X  T • support s(X): fraction of transactions containing X Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  5. Rule Measures: Support and Confidence Customer buys both Customer buys diaper • Find all the rules Beer  Diaper with minimum confidence and support • support,s, probability that a transaction contains {Beer  Diaper} • confidence,c,conditional probability that a transaction having {Beer} also contains Diaper Customer buys beer Let minimum support 50%, and minimum confidence 50%, we have • A  C (50%, 66.6%) • C  A (50%, 100%) Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  6. Association rules: Problem description • Association rule: X Y, X  I, Y  I, X  Y =  • Support of rule = s(X  Y) • Confidence of rule = s(X  Y) / s(X) • Problem: Find all association rules with confidence larger that MINCONF and support larger than MINSUP • Notice that: • if X  Y, Y  Z, then not necessarily X  Z because X  Z may not have enough support or confidence Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  7. Mining Association Rules—An Example For rule AC: support = support({AC}) = 50% confidence = support({AC})/support({A}) = 66.6% Min. support 50% Min. confidence 50% Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  8. Association Rule Mining • Two basic steps • Find all frequent itemsets • Satisfying minimum support • Find all strong association rules • Generate association rules from frequent itemsets • Keep rules satisfying minimum confidence Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  9. Generating Frequent Itemsets: Naive algorithm n <- |D| for each subset s of I do l <- 0 for each transaction T in D do if s is a subset of T then l <- l + 1 if minimum support <= l/n then add s to frequent subsets Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  10. Analysis of naive algorithm • O(2m) subsets of I • Scan n transactions for each subset • O(2mn) tests of s being subset of T • Growth is exponential in the number of items! • Can we do better? Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  11. Reducing Number of Candidates: Apriori • Apriori principle: • If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle holds due to the following property of the support measure: • Support of an itemset never exceeds the support of its subsets • This is known as the anti-monotone property of support Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  12. Illustrating Apriori Principle Found to be Infrequent Pruned supersets Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  13. Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generatecandidates involving Cokeor Eggs) Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13 Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  14. The Apriori Algorithm • Join Step: Ckis generated by joining Lk-1with itself • Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk; Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  15. How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert intoCk select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ckdo forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  16. Example of Generating Candidates • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd} Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  17. How to Count Supports of Candidates? • Candidate counting: • Scan the database of transactions to determine the support of each candidate itemset • To reduce the number of comparisons, store the candidates in a hash structure • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  18. Example • b=2, c=3 Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  19. The Apriori Algorithm — Example Database D L1 C1 Scan D C2 C2 L2 Scan D L3 C3 Scan D Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  20. Another example of Apriori • Minsup = 2 Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  21. Another example of Apriori Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  22. Generating rules (2nd sub-problem) • Given a frequent itemset L, find all non-empty subsetsf  L such that f  L – f satisfies the minimum confidence requirement • If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC  BD, AD  BC, BC AD, BD AC, CD AB, • If |L| = k, then there are 2k – 2 candidate association rules (ignoring L   and   L) Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  23. Rule Generation with anti-monotone property • How to efficiently generate rules from frequent itemsets? • In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) • But confidence of rules generated from the same itemset has an anti-monotone property • e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD) • Confidence is anti-monotone w.r.t. number of items on the RHS of the rule Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  24. Pruned Rules Rule Generation: example of anti-monotonicity Lattice of rules Low Confidence Rule Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  25. Methods to Improve Apriori’s Efficiency • Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans • Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB • Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness • Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  26. Hash-based method • |C2| = comb(|L1|, 2) = O(|L1|2) • |L1| maybe thousands, |C2| bottleneck • While counting the support of C1 • find the 2-itemsets in each transaction • hash them into buckets, accumulating counts • if a 2-itemset belongs to a bucket with count less than MINSUP, then it cannot be large Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  27. Example of hash-based method C1 {A} (2) {B} (3) {C} (3) {D} (1) {E} (3) L1 {A} {B} {C} {E} L1 x L1 {A,B} {A,C} {A,E} {B,C} {B,E} {C,E} C2 {A,C} {A,E} {B,C} {B,E} {C,E} • ACD • BCE • 300 ABCE • 400 BE 3 {C,E} {C,E} {A,D} 3 {B,C} {B,C} {A,E} 0 3 {B,E} {B,E} {B,E} 1 {A,B} 3 {A,C} {C,D} {A,C} Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  28. Transaction trimming • While counting Ck (k phase) • an item in a transaction t can be trimmed if it does not appear in at least k members of Ck (k-itemset candidates) • example: candidates ABC, ABD, BCD, and ACD is not • if t contains A, then A can be trimmed from t, because will not participate in forthcoming passes (ABCD will not be generated) • if t contains no more than k items, discard it Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  29. Example of transaction trimming • ACD • BCE • 300 ABCE • 400 BE discard {B,C,E} {B,C,E} discard due to length {C} {B,C,E} {B,C,E} {B,E} C2 {A,C} {B,C} {B,E} {C,E} Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  30. Partitioning method • Logically partitions the database into a number of non-overlapping partitions • each partition fits entirely in main memory • Finds the large itemsets in each partition (local large itemsets) • multiple phases are performed in main memory, I/O only to read the partition once • Determine which of the local large are also global large itemsets • in one scan of the database • no global large is missed Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  31. Partition Algorithm partition_database(D) n = Number of partitions //Phase I fori=1 to n do read i -th partition Li = gen_all_large_itemsets(i-th partition) C = L1 …  Ln fori=1 to n do read i -th partition foreach candidate c  C update c.count L = {c | c  C, c.count  MINSUP} Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  32. Partitioning example Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  33. Why Partition is correct? • We have to show that no large itemset (global large) is missed • A global large itemset X is not missed if it is canidate in the 2nd phase, i.e., when its is local large in at least one partition • Let X be a global large which is not local large in any partition • t(X, Pi ) < MINSUP |Pi| 1  i  nΣt(X, Pi ) < MINSUP Σ|Pi|  t(X, D) < MINSUP  |D|  s(X) < MINSUP (contradiction) Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  34. Size of candidate set for the 2nd phase • Based on the assumption that the size |C | of all local large itemsets is small enough • comparable to |L| (global large) • The same MINSUP is used, one can expect the same as |L| • at worst case n  |L| • in practice, many overlap (the same in different partitions) • For small n, |C | very close to |L| • For larger n, |C | larger but not the worst case Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  35. Sampling method • Main idea • get a random sample s of database D • find large itemsets S in s • verify which of S are large in D (one scan) • identify potentially “missed” large (possibly a 2nd scan) • Gains • mining s is for free in terms of I/O • one scan of D in the best and two in the worst case Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  36. Negative border Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  37. How not to miss anything? • Goal is to avoid missing any itemset that is frequent in the full set of baskets • If X is frequent in D, but not frequent in S, then there  Y  X, Y Bd-(S) • We check if something in the negative border is actually frequent in D • Try to choose Minsup in S so the probability of failure is low, while the number of itemsets checked on the second pass fits in main-memory Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  38. Sampling algorithm Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  39. Example • Domain = {A, B, C, D, E, F} • S = {A,B,C}, {A,C,F}, {A,D}, {B,D} • Bd-(S) = {B,F}, {C,D}, {D,F}, {E} • candidates not found large in sample • L = {A,B}, {B,F}, {A,C,F} large in D • {B,F} represents a “miss” • {A,B},{B,F} and {A,F} generate {A,B,F} which can possibly be large but was not count Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  40. Is Apriori Fast Enough? — Performance Bottlenecks • The core of the Apriori algorithm: • Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets • Use database scan and pattern matching to collect counts for the candidate itemsets • The bottleneck of Apriori: candidate generation • Huge candidate sets: • 104 frequent 1-itemset will generate 107 candidate 2-itemsets • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100  1030 candidates. • Multiple scans of database: • Needs (n +1 ) scans, n is the length of the longest pattern Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  41. Mining Frequent Patterns Without Candidate Generation • Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure • highly condensed, but complete for frequent pattern mining • avoid costly database scans • Develop an efficient, FP-tree-based frequent pattern mining method • A divide-and-conquer methodology: decompose mining tasks into smaller ones • Avoid candidate generation: sub-database test only! Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  42. {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Construct FP-tree from a Transaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 0.5 • Steps: • Scan DB once, find frequent 1-itemset (single item pattern) • Order frequent items in frequency descending order • Scan DB again, construct FP-tree Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  43. Mining Frequent Patterns Using FP-tree • General idea (divide-and-conquer) • Recursively grow frequent pattern path using the FP-tree • Method • For each item, construct its conditional pattern-base, and then its conditional FP-tree • Repeat the process on each newly created conditional FP-tree • Until the resulting FP-tree is empty, or it contains only one path(single path will generate all the combinations of its sub-paths, each of which is a frequent pattern) Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  44. Major Steps to Mine FP-tree • Construct conditional pattern base for each node in the FP-tree • Construct conditional FP-tree from each conditional pattern-base • Recursively mine conditional FP-trees and grow frequent patterns obtained so far • If the conditional FP-tree contains a single path, simply enumerate all the patterns Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  45. Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Step 1: From FP-tree to Conditional Pattern Base • Starting at the frequent header table in the FP-tree • Traverse the FP-tree by following the link of each frequent item • Accumulate all of transformed prefix paths of that item to form a conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  46. {} f:3 c:3 a:3 m-conditional FP-tree Step 2: Construct Conditional FP-tree • For each pattern-base • Accumulate the count for each item in the base • Construct the FP-tree for the frequent items of the pattern base {} • m-conditional pattern base: • fca:2, fcab:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1  a:3 p:1 m:2 b:1 p:2 m:1 Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  47. Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty Mining Frequent Patterns by Creating Conditional Pattern-Bases Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  48. Single FP-tree Path Generation • Suppose an FP-tree T has a single path P • The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P {} All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam f:3  c:3 a:3 m-conditional FP-tree Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  49. Άλλο ένα παράδειγμα Minsup = 2 Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

  50. Άλλο ένα παράδειγμα Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

More Related