1 / 30

Data Mining Association Rules

Data Mining Association Rules. Jen-Ting Tony Hsiao Alexandros Ntoulas CS 240B May 21, 2002. Outline. Overview of data mining association rules Algorithm Apriori FP-Trees. Data Mining Association Rules.

lotus
Download Presentation

Data Mining Association Rules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Association Rules Jen-Ting Tony Hsiao Alexandros Ntoulas CS 240B May 21, 2002

  2. Outline • Overview of data mining association rules • Algorithm Apriori • FP-Trees

  3. Data Mining Association Rules • Data mining is motivated by the decision support problem faced by most large retail organizations • Bar-code technology made it possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data • Successful organizations view such databases as important pieces of the marketing infrastructure • Organizations are interested in instituting information-driven marketing processes, managed by database technology, that enable marketers to develop and implement customized marketing programs and strategies

  4. Data Mining Association Rules • For example, 98% of customers that purchase tires and auto accessories also get automotive services done • Finding all such rules is valuable for cross-marketing and attached mailing applications • Other applications include catalog design, add-on sales, store layout, and customer segmentation based on buying patterns • Databases involved in these applications are very large, thus imperative to have fast algorithms for this task

  5. Formal Definition of Problem • Formal statement of the problem [AIS93]: • Let I = {i1, i2, …, im} be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T I. • A transaction T contains X, a set of some items in I, if X  T. • An association rule is an implication of the form X  Y, where X  I, Y  I, and X  Y = Ø. • X  Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y. • X  Y has support s in the transaction set D if s% of transactions in D contain X  Y.

  6. Example of Confidence and Support • Given the following table of transactions: • 1  2 has 100% confidence, with 100% support • 3  4 has 100% confidence, with 33% support • 2  3 has 33% confidence, with 33% support

  7. Problem Decomposition • Find all sets of items (itemsets) that have transaction support above minimum support. The support for an itemset is the number of transactions that contain the itemset. Itemsets with minimum support are called large itemsets, and all others small itemsets. • Use large itemsets to generate the desired rules. If ABCD and AB are large itemsets, then we can determine if rule AB CD holds by computing the ratio conf = support(ABCD)/support(AB). If conf  minconf, then the rule holds. (The rule will surely have minimum support because ABCD is large.)

  8. Discoverying Large Itemsets • Algorithms for discovering large itemsets make multiple passes over the data • In the first pass, we count the support of individual items and determine which of them are large (with minimum support) • In each subsequent pass, we start with a seed set of itemsets found to be large in the previous pass, then use this seed set for generating new potentially large itemsets, called candidate itemsets, and count the actual support for these candidate itemsets during the pass over the data • At the end of the pass, we determine which of the candidate itemsets are actually large, and they become the seed for the next pass. • This process continues until no new large itemsets are found

  9. Notations • The number of items in an itemset is its size and call an itemset of size k a k-itemset • Items within an itemset are kept in lexicographic order

  10. Algorithm Apriori • L1 = {large 1-itemsets}; • for ( k = 2; Lk-1 Ø; k++ ) do begin • Ck = apriori-gen(Lk-1); // New candidates • forall transactions t  Ddo begin • Ct = subset(Ck, t); Candidates contained in t • forall candidates c  Ctdo • c.count++; • end • Lk = {c  Ck | c.count  minsup} • end • Answer = kLk;

  11. Apriori Candidate Generation • The apriori-gen function takes as argument Lk – 1, the set of all large (k – 1)-itemsets. It returns a superset of the a set of all large k-itemsets. • First, in the join step, we join Lk – 1 with Lk – 1: insert intoCk selectp.item1, p.item2, …, p.itemk – 1, q.itemk – 1 fromLk – 1 p, Lk – 1q wherep.item1 = q.item1, …, p.itemk – 2 = q.itemk – 2, p.itemk – 1 q.itemk – 1 • Next, in the prune step, we delete all itemsets c  Ck such that some (k – 1)-subset of c is not in Lk – 1: forall itemsets c  Ckdo forall (k – 1)-subsets s of c do if (s  Lk – 1) then delete c from Ck;

  12. Apriori-gen Example • Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}} • In the join step: • {1 2 3} joins with {1 2 4} to produce {1 2 3 4} • {1 3 4} joins with {1 3 5} to produce {1 3 4 5} • After the join step, C4 will be {{1 2 3 4}, {1 3 4 5}} • In the prune step: • {1 2 3 4} is tested for existence of 3-item subsets within L3, thus for {1 2 3}, {1 3 4}, and {2 3 4} • {1 3 4 5} is tested for {1 3 4}, {1 4 5}, and {3 4 5}, with {1 4 5} not found, and thus this set is deleted • We then will be left with only {1 2 3 4} in C4

  13. Discovering Rules • For every large itemset l, we find all non-empty subsets of l • For every such subset a, we output a rule of the form a (l – a) if the ratio of support(l) to support(a) is at least minconf • We consider all subsets of l to generate rules with multiple consequents

  14. Simple Algorithm • We can improve the proceeding procedure by generating the subsets of large itemset in a recursive depth-first fashion • For example, given an itemset ABCD, we first consider the subset ABC, then AB, etc • Then if a subset a of a large itemset l does not generate a rule, the subsets of a need not to be considered for generating rules using l • For example, if ABC D does not have enough confidence, we need not check whether AB  CD holds

  15. Simple Algorithm • forall large itemsets lk, k 2 do • call genrules(lk, lk) • Procedure genrules(lk: large k-itemsets, am: large m-itemsets) • A = {(m – 1)-itemsets am – 1 | am – 1 am}; • forallam – 1  Ado again • conf = support(lk)/support(am – 1) • if (conf  minconf) then begin • output the rule am – 1 (lk - am – 1), • with confidence = conf and support = support(lk) • if (m – 1 > 1) then • call genrules(lk, am – 1); • // to generate rules with subsets of am – 1 • as the antecedents • end • end

  16. Faster Algorithm • Earlier, we showed that if a rule does not hold for a superset does not hold, it also does not hold for its subset • For example, if ABC D does not have enough confidence, AB  CD also will not hold • This can also be applied in the reverse direction, where if a rule holds for a subset, it also holds for its superset • For example, if AB  CD has enough confidence, ABC D will also hold

  17. Faster Algorithm • forall large itemsets lk, k 2 do begin • H1 = { consequents of rules derived from lk with one item in the consequent }; • call ap-genrules(lk, H1) • end • Procedure ap-genrules(lk: large k-itemsets, Hm: set of m-item consequents) • if (k > m + 1) then begin • Hm + 1 = apriori-gen(Hm); • forallhm + 1 Hm + 1do begin • conf = support(lk)/support(lk - hm + 1); • if (conf minconf) then • output the rule (lk - hm + 1)  hm + 1 with • confidence = conf and support = support(lk); • else • delete hm + 1 from Hm + 1 • end • call ap-genrules(lk, Hm + 1); • end

  18. Faster Algorithm • For example, consider a large itemset ABCDE. Assume that ACDE B and ABCE  E are the only one-item consequent rules with minimum confidence. • In the simple algorithm, the recursive call genrules(ABCDE, ACDE) will test if the two-item consequence rules ACD  BE, ADE  BC, CDE  BA, and ACE  BD hold. • The first of these rules cannot hold, because E  BE, and ABCD  E does not have minimum confidence. The second and third rules cannot hold for similar reasons. • The only two-item consequent rule that can possibly hold is ACE  BD, where B and D are the consequents in the valid one-item consequent rules. This is the only rule that will be tested by the faster algorithm.

  19. Mining Frequent Itemsets • If we use candidate generation • Need to generate a huge number of candidate sets. • Need to repeatedly scan the database and check a large set of candidates by pattern matching. • Can we avoid that? • FP-Trees (Frequent Pattern Trees)

  20. FP-Tree • General Idea • Divide and Conquer • Outline of the procedure • For each item create its conditional pattern base • Then expand it and create its conditional FP-Tree • Repeat the process recursively on each constructed FP-Tree • Until FP-Tree is either empty or contains only one path

  21. Example TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o} 400 {b, c, k, s, p} 500{a, f, c, e, l, p, m, n} • For support = 50% • Step 1: Scan DB find frequent 1-itemsets • Step 2: Order them in descending order • Step 3: Scan DB and construct FP-Tree Item frequency f 4 c 4 a 3 b 3 m 3 p 3

  22. root f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Example – Conditional Pattern Base • Start with the frequent header table • Traverse tree by following links from frequent items • Accumulate prefix paths to form a conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 Header Table Item frequency f 4 c 4 a 3 b 3 m 3 p 3

  23. {root} f:3 c:3 a:3 m-conditional FP-tree Example – Conditional FP-Tree • For every pattern base • Accumulate the frequency for each item • Construct an FP-Tree for the frequent items of the pattern base • m-conditional pattern base: • fca:2, fcab:1 All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam

  24. Mining The FP-Tree • Construct conditional pattern bases for each node in the initial FP-Tree • Construct conditional FP-Tree from each conditional pattern base • Recursively visit conditional FP-Trees and expand frequent patterns • If the conditional FP-Tree contains only one path enumerate all the combinations of the items

  25. Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} NULL a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f NULL NULL Example – Mining The FP-Tree

  26. {root} f:3 c:3 am-conditional FP-tree {root} {root} f:3 {root} c:3 f:3 f:3 a:3 cm-conditional FP-tree cam-conditional FP-tree m-conditional FP-tree Example – Mining The FP-Tree • Recursively visit conditional FP-Trees

  27. {root} f:3 c:3 a:3 Example – Mining The FP-Tree • If we have a single path, the frequent itemsets of a tree can be generated by enumerating all combinations of nodes of the tree. Enumeration of all patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam m-conditional FP-tree

  28. Advantages of FP-Tree • Completeness: • Never breaks a long pattern of any transaction • Preserves complete information for frequent pattern mining • Compactness • Reduce irrelevant information—infrequent items are gone • Frequency descending ordering: more frequent items are more likely to be shared • Never be larger than the original database (if not count node-links and counts)

  29. Comparison of FP-Tree Method and Apriori Data set size T25I20D10K

  30. References • R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. In Proceedings of the 1993 International Conference on Management of Data (SIGMOD 93), pages 207-216, May 1993 • R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules (1994) Proc. 20th Int. Conf. Very Large Data Bases, VLDB • Jiawei Han, Jian Pei, Yiwen Yin. Mining Frequent Patterns without Candidate Generation. SIGMOD Conference 2000: 1-12

More Related