1 / 72

Data Mining Association Rule Mining

Data Mining Association Rule Mining. Mining Association Rules in Large Databases. Association rule mining Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases Mining various kinds of association/correlation rules

rubyz
Download Presentation

Data Mining Association Rule Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Association Rule Mining Data Mining

  2. Mining Association Rules in Large Databases • Association rule mining • Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases • Mining various kinds of association/correlation rules • Applications/extensions of frequent pattern mining • Summary Data Mining

  3. What Is Association Mining? • Association rule mining: • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. • Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database • Motivation: finding regularities in data • What products were often purchased together? — Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents? Data Mining

  4. Why Is Frequent Pattern or Association Mining an Essential Task in Data Mining? • Foundation for many essential data mining tasks • Association, correlation, causality • Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association • Associative classification, cluster analysis, iceberg cube • Broad applications • Basket data analysis, cross-marketing, catalog design, sale campaign analysis • Web log (click stream) analysis, DNA sequence analysis, etc. Data Mining

  5. Customer buys both Customer buys diapers Customer buys beer Basic Concepts: Frequent Patterns and Association Rules • Itemset X={x1, …, xk} • Find all the rules XYwith min confidence and support • support, s, probability that a transaction contains XY • confidence, c,conditional probability that a transaction having X also contains Y. • Let min_support = 50%, min_conf = 50%: • A  C (50%, 66.7%) • C  A (50%, 100%) Data Mining

  6. Mining Association Rules—an Example Min. support 50% Min. confidence 50% For rule AC: support = support({A}{C}) = 50% confidence = support({A}{C})/support({A}) = 66.6% Data Mining

  7. Chapter 6: Mining Association Rules in Large Databases • Association rule mining • Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases • Mining various kinds of association/correlation rules • Applications/extensions of frequent pattern mining • Summary Data Mining

  8. Apriori: A Candidate Generation-and-test Approach • Any subset of a frequent itemset must be frequent • if {beer, diaper, nuts} is frequent, so is {beer, diaper} • every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! • Method: • generate length (k+1) candidate itemsets from length k frequent itemsets, and • test the candidates against DB • The performance studies show its efficiency and scalability Data Mining

  9. The Apriori Algorithm — An Example L1 Database TDB C1 1st scan C2 C2 2nd scan L2 L3 C3 3rd scan Data Mining

  10. The Apriori Algorithm • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for(k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk; Data Mining

  11. Important Details of Apriori • How to generate candidates? • Step 1: self-joining Lk • Step 2: pruning • Example of Candidate-generation • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd} Data Mining

  12. How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert intoCk select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ckdo forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Data Mining

  13. Challenges of Frequent Pattern Mining • Challenges • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates • Improving Apriori: general ideas • Reduce passes of transaction database scans • Shrink number of candidates • Facilitate support counting of candidates Data Mining

  14. Bottleneck of Frequent-pattern Mining • Multiple database scans are costly • Mining long patterns needs many passes of scanning and generates lots of candidates • To find frequent itemset i1i2…i100 • # of scans: 100 • # of Candidates: + + … + = 2100-1 = 1.27*1030 ! • Bottleneck: candidate-generation-and-test • Can we avoid candidate generation? Data Mining

  15. Mining Frequent Patterns WithoutCandidate Generation • Grow long patterns from short ones using local frequent items • “abc” is a frequent pattern • Get all transactions having “abc”: DB|abc • “d” is a local frequent item in DB|abc  abcd is a frequent pattern Data Mining

  16. {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Construct FP-tree From A Transaction Database TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 3 • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order, f-list • Scan DB again, construct FP-tree F-list=f-c-a-b-m-p Data Mining

  17. Benefits of the FP-tree Structure • Completeness • Preserve complete information for frequent pattern mining • Never break a long pattern of any transaction • Compactness • Reduce irrelevant info—infrequent items are gone • Items in frequency descending order: the more frequently occurring, the more likely to be shared • Never be larger than the original database (not count node-links and the count field) Data Mining

  18. Partition Patterns and Databases • Frequent patterns can be partitioned into subsets according to f-list • F-list=f-c-a-b-m-p • Patterns containing p • Patterns having m but no p • … • Patterns having c but no a nor b, m, p • Pattern f • Completeness and non-redundancy Data Mining

  19. {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Find Patterns Having P From P-conditional Database • Starting at the frequent item header table in the FP-tree • Traverse the FP-tree by following the link of each frequent item p • Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 Data Mining

  20. {} f:3 c:3 a:3 m-conditional FP-tree From Conditional Pattern-bases to Conditional FP-trees • For each pattern-base • Accumulate the count for each item in the base • Construct the FP-tree for the frequent items of the pattern base • m-conditional pattern base: • fca:2, fcab:1 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam f:4 c:1 c:3 b:1 b:1   a:3 p:1 m:2 b:1 p:2 m:1 Data Mining

  21. {} f:3 c:3 am-conditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Recursion: Mining Each Conditional FP-tree Cond. pattern base of “am”: (fc:3) {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree {} Cond. pattern base of “cam”: (f:3) f:3 cam-conditional FP-tree Data Mining

  22. Mining Frequent Patterns With FP-trees • Idea: Frequent pattern growth • Recursively grow frequent patterns by pattern and database partition • Method • For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree • Repeat the process on each newly created conditional FP-tree • Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern Data Mining

  23. FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K Data Mining

  24. Why Is FP-Growth the Winner? • Divide-and-conquer: • decompose both the mining task and DB according to the frequent patterns obtained so far • leads to focused search of smaller databases • Other factors • no candidate generation, no candidate test • compressed database: FP-tree structure • no repeated scan of entire database • basic ops—counting local freq items and building sub FP-tree, no pattern search and matching Data Mining

  25. Visualization of Association Rules: Pane Graph Data Mining

  26. Other Algorithms for Association Rules Mining • Partition Algorithm • Sampling Method • Dynamic Itemset Counting Data Mining

  27. Partition Algorithm Executes in two phases: Phase I: • logically divide the database into a number of non-overlapping partitions • generate all large itemsets for each partition • merge all these large itemsets into one set of all potentially large itemsets Phase II: • generate the actual support for these itemsets and identify the large itemsets Data Mining

  28. Partition Algorithm (cont.) • Partition sizes are chosen so that each partition can fit in memory and the partitions are read only once in each phase • Assumptions: • transactions are of form TID, ij, ik,…,in • items in a transaction are kept in lexicographical order • TIDs are monotonically increasing • items in itemset are also kept in sorted lexicographical order • approp. size of dB in blocks or pages is known in advance Data Mining

  29. Partition Algorithm (cont.) •localsupport for an itemset  fraction of transactions containing that itemset in a partition •local large itemset  itemset whose local support in a partition is above minimum support; may or may not be large in the context of entire database •local candidate itemset that is being tested for minimum support within a given partition •globalsupport, global large itemset, and global candidate itemset defined as above in the context of the entire database Data Mining

  30. Notation Data Mining

  31. The Algorithm Data Mining

  32. Partition Algorithm(cont.) • Discovering Rules: • if l is large itemset, then for every subset a of l, the ratio support(l) / support(a) is computed • if the ratio is at least equal to the user specified minimum confidence, then the rule a (l–a) is output • Size of the Global Candidate Set: • its size is bounded by n times the size of the largest set of locally large itemsets • for sufficiently large partition sizes, the number of local large itemsets is comparable to the number of large itemsets generated for the entire database Data Mining

  33. Partition Algorithm(cont.) If data is uniformly distributed across partitions, a large number of itemsets generated for individual partitions may be common Data Mining

  34. Partition Algorithm(cont.) Data skew can be eliminated to a large extent by randomizing the data within the partitions. Data Mining

  35. Comparison with Apriori Number of comparisons Execution times in seconds Data Mining

  36. Comparison (cont.) Number of database read requests Data Mining

  37. Comparison (cont.) Data Mining

  38. Partition Algorithm Highlights: • Achieved both CPU and I/O improvements over Apriori • Scans the database twice • Scales linearly with number of transactions • Inherent parallelism in the algorithm can be exploited for the implementation on a parallel machine Data Mining

  39. Sampling Algorithm • Makes one full pass, two passes in the worst case • Pick a random sample, find all association rules and then verify the results • In very rare cases not all the association rules are produced in the first scan because the algorithm is probabilistic • Samples small enough to be handled totally in main memory give reasonably accurate results • Tradeoff accuracy against efficiency Data Mining

  40. Sampling Step • A superset can be determined efficiently by applying the level-wise method on the sample in main memory, and by using a lowered frequency threshold • In terms of partition algorithm, discover locally frequent sets from one part only, and with a lower threshold Data Mining

  41. Negative Border Given a collection SP(R) of sets, closed with respect to the set inclusion relation, the negative border Bd-(S) of S consists of the minimal itemsets X  R not in S • the collection of all frequent sets is closed w.r.t set inclusion • Example: •R={A,…,F} •F(r, min_fr) is {A},{B},{C},{F},{A,B},{A,C},{A,F},{C,F},{A,C,F} • the set {B,C} is not in the collection but all its subsets are • whole negative border is Bd-(F(r,min_fr))={{B,C},{B,F},{D},{E}} Data Mining

  42. Sampling Method (cont.) Intuition behind the negative border: given a collection S of sets that are frequent, the negative border contains the “closest” itemsets that could also be frequent The negative border Bd-(F(r,min_fr)) needs to be evaluated, in order to be sure that no frequent sets are missed If F(r,min_fr)S, then SBd-(S) is a sufficient collection to be checked Determining SBd-(S) is easy: it consists of all sets that were candidates of the level-wise method in the sample Data Mining

  43. The Algorithm Data Mining

  44. Sampling Method (cont.) • search for frequent sets in the sample, but lower the frequency threshold so much that it is unlikely that any frequent sets are missed • evaluate the frequent sets from the sample and their border in the rest of the database • A miss is a frequent set Y in F(r,min_fr) that is in Bd-(S) • There has been a failure in the sampling if all frequent sets are not found in one pass, i.e., if there is a frequent set X in F(r,min_fr) that is not in SBd-(S) Data Mining

  45. Sampling Method (cont.) • Misses themselves are not a problem, they, however, indicate a potential problem: if there is a miss Y, then some superset of Y might be frequent but not in SBd-(S) • Simple way to recognize a potential failure is thus to check if there are any misses • In the fraction of cases where a possible failure is reported, all frequent sets can be found by making a second pass over the dB • Depending on how randomly the rows have been assigned to the blocks, this method can give good or bad results Data Mining

  46. Example • relation r has 10 million rows over attributes A,…,F • minimum support = 2%; random sample, s has 20,000 rows • lower the frequency to 1.5% and find S=F(s,1.5%) • let S be {A,B,C},{A,C,F},{A,D},{B,D} and negative border be {B,F},{C,D},{D,F},{E} • after a database scan we discover F(r,2%) = {A,B}, {A,C,F} • suppose {B,F} turns out to be frequent in r, i.e. {B,F} is a miss • what we have actually missed is the set {A,B,F} which can be frequent in r, since all its subsets are Data Mining

  47. Misses Data Mining

  48. Performance Comparison Data Mining

  49. Dynamic Itemset Counting • Partition database into blocks marked by start points • New candidates are added at each start point unlike Apriori • Dynamic – estimates the support of all itemsets that have been counted so far, adding new candidate itemsets if allof their subsets are estimated to be frequent • Reduces the number of passes while keeping the number of itemsets counted relatively low • Fewer database scans than Apriori Data Mining

  50. Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins DIC: Reduce Number of Scans ABCD ABC ABD ACD BCD AB AC BC AD BD CD Transactions 1-itemsets B C D A 2-itemsets Apriori … {} Itemset lattice 1-itemsets 2-items DIC 3-items Data Mining

More Related