1 / 19

Association Rules

Association Rules. What is Association Rule?. An association rule is a model that identifies specific types of data associations. They are frequent patterns : pattern(set of items, sequence , etc.) that occur frequently in a database ;

eaton-henry
Download Presentation

Association Rules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Association Rules

  2. What is Association Rule? • An association rule is a model that identifies specific types of data associations. • They are frequent patterns : pattern(set of items, sequence , etc.) that occur frequently in a database ; • Frequent pattern mining: finding regularities in datafor e.g What products were often purchased together? • What are the subsequent purchase after buying a car? • Can we automatically profile customers?

  3. Need of Association of Rules • Foundation for many data mining tasks • - Association rules, correlation , causality , sequential patterns , spatial and multimedia patterns , associative classification , cluster analysis, iceberg cube, … • Broad applications • - Basket data analysis, cross marketing , catalog design , Sale campaign analysis , web log (Click stream) analysis, ...

  4. Basics • Itemset: a set of items e.g , acm={a,c,m} • Support of itemsets ; Sup(acm)=3 • Given min_sup=3,acm is a frequent pattern • Frequent pattern mining: find all frequent patterns in a database Transaction Database

  5. Applications • Correlation , causality analysis and mining interesting rules • Max patterns and frequent closed itemsets • Constraint-based mining • Sequential Patterns • Periodic Patterns • Computing icebergs cubes

  6. Frequent Pattern Mining Methods • Apriori and its variations / improvements • Mining frequent – patterns without candidate generation • Mining max-patterns and closed itemsets • Mining multi-dimensional , multi-level frequent patterns with flexible support constraints • Interestingness: correlation and causality

  7. Apriori: Candidate Generation-and-test • Any subset of a frequent itemset must be also frequent – an anti-monotone property • - A transaction containing {ATM card,PAN Card} also contain {debit,credit cards} • - {ATM card , PAN card} is frequent -> • { debit,credit} must also be frequent • No superset of any infrequent itemset should be generated or tested • - Many item combinations can be pruned

  8. Apriori - based Mining • Generate length(k+1) candidate itemsets from length k frequent itemsets , and • Test the candidates against DB

  9. Apriori Algorithm • A level-wise , candidate-generation-and test approach (Gavaskar and Srikant 1994) • Database D 1-candidates Freq 1-itemsets 2-candidates Counting Freq 2-itemsets 3-candidates Scan D Scan D Freq 3-itemsets

  10. Steps: • Ck: Candidate itemset of size k • Lk: frequent itemset of size k • L1 = {frequent items}; • For(k=1;Lk !=  ; k++) do • - Ck+1 = candidates generated from Lk; • for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t • Lk+1 = candidates in Ck+1 with min_support • Return Uk Lk;

  11. Details of Apriori • How to generate candidates? • step 1 : self joining Lk • step 2 : pruning • - How to count supports of candidates?

  12. How to Generate Candidates? • Suppose the items in Lk are listed in an order • Step 1: self-join Lk-1 • INSERT INTO Ck • SELECT p.item1,p.item2,…,p.itemk-1,q.itemk-1 • FROM Lk-1 p , Lk-1 q • WHERE p.item1 = q.item1,..,p.item k-2 = q.itemk-2,p.itemk-1 < q.itemk-1 • Step 2: pruning • -For each itemset c in Ck do • - For each (k-l) subset s of c do if(s is not in Lk-1) then delete c from Ck

  13. Example • L3 = {abc , abd , acd , ace, bcd} • Self-joining: L3*L3 • - abcd from abc and abd • - acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4 = {abcd}

  14. How to Count Supports of Candidates? • Why counting supports of candidates a problem? • - The total number of candidates can be very huge • - One transaction may contain many candidates • -Method: • - Candidate itemsets are stored in a hash – tree • Leaf node of hash – tree contains a list of itemsets and counts • Interior node contains a hash table • Subset function: finds all the candidates contained in a transaction

  15. Example: Subset function Transaction 1 2 3 5 6 1,4,7 3,6,9 2,5,8 2 3 4 5 6 7 1 + 2 3 5 6 1 3 + 5 6 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1 4 5 1 3 6 1 2 + 3 5 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9

  16. Challenges in Frequent Pattern Mining • Challenges • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates • Improving Apriori: general ideas • - Reduce number of transaction database scans • - Shrinks number of candidates • - Facilitate support counting of canddiates

  17. Reduce Number of Scans • Reduce the Number of Candidates • A hashing bucket count < min_sup->every candidate in the buck is infrequent • Candidates : a , b , c ,d,e • Hash entries: {ab,ad,ae} {bd,be,de} … • Large 1-itemset : a,b,d,e • - The sum of counts of {ab,ad,ae} < min_sup -> ab should not be a candidate 2-itemset

  18. Scan Database Only Once • Partition the database into n partitions • Itemset X is frequent -> X frequent in atleast one partition • - Scan 1: partition database and find local frequent patterns • - Scan 2: consolidate global frequent patterns

  19. Sampling for Frequent • Select a sample of original database , mine frequent patterns within sample using Apriori • Scan database once to verify frequent itemsets found in sample , only borders of closure of frequent patterns are chechked • - Example : check abcd instead of ab,ac,..,etc. • -Scan database again to find missed frequent patterns

More Related