1 / 46

Business Systems Intelligence: 4. Mining Association Rules

Dr. Brian Mac Namee ( www.comp.dit.ie/bmacnamee ). Business Systems Intelligence: 4. Mining Association Rules. Acknowledgments. These notes are based (heavily) on those provided by the authors to accompany “Data Mining: Concepts & Techniques” by Jiawei Han and Micheline Kamber

gada
Download Presentation

Business Systems Intelligence: 4. Mining Association Rules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee) Business Systems Intelligence:4. Mining Association Rules

  2. Acknowledgments • These notes are based (heavily) on those provided by the authors to accompany “Data Mining: Concepts & Techniques” by Jiawei Han and Micheline Kamber • Some slides are also based on trainer’s kits provided by More information about the book is available at:www-sal.cs.uiuc.edu/~hanj/bk2/ And information on SAS is available at:www.sas.com

  3. Mining Association Rules • Today we will look at: • Association rule mining • Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases • Sequential pattern mining • Applications/extensions of frequent pattern mining • Summary

  4. Frequent Pattern: A pattern (set of items, sequence, etc.) that occurs frequently in a database What Is Association Mining? • Association rule mining: • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories

  5. Motivations For Association Mining • Motivation: Finding regularities in data • What products were often purchased together? • Beer and nappies! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents?

  6. Motivations For Association Mining (cont…) • Foundation for many essential data mining tasks • Association, correlation, causality • Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association • Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression)

  7. Motivations For Association Mining (cont…) • Broad applications • Basket data analysis, cross-marketing, catalog design, sale campaign analysis • Web log (click stream) analysis, DNA sequence analysis, etc.

  8. Market Basket Analysis • Market basket analysis is a typical example of frequent itemset mining • Customers buying habits are divined by finding associations between different items that customers place in their “shopping baskets” • This information can be used to develop marketing strategies

  9. Market Basket Analysis (cont…)

  10. Association Rule Basic Concepts • Let I be a set of items {I1, I2, I3,…, Im} • Let D be a database of transactions where each transaction T is a set of items such that TI • So, if A is a set of items a transaction T is said to contain A if and only if A T • An association rule is an implication AB where AI, BI, and A B=

  11. Association Rule Support & Confidence • We say that an association rule AB holds in the transaction set D with support, s, and confidence, c • The support of the association rule is given as the percentage of transactions in D that contain both A and B (or A B) • So, the support can be considered the probability P(A B)

  12. Association Rule Support & Confidence (cont…) • The confidence of the association rule is given as the percentage of transactions in D containing A that also contain B • So, the confidence can be considered the conditional probability P(B|A) • Association rules that satisfy minimum support and confidence values are said to be strong

  13. Itemsets & Frequent Itemsets • An itemset is a set of items • A k-itemset is an itemset that contains k items • The occurrence frequency of an itemset is the number of transactions that contain the itemset • This is also known more simply as the frequency, support count or count • An itemset is said to be frequent if the support count satisfies a minimum support count threshold • The set of frequent itemsets is denoted Lk

  14. Support & Confidence Again • Support and confidence values can be calculated as follows:

  15. Mining Association Rules: An Example

  16. Mining Association Rules: An Example (cont…)

  17. Association Rule Mining • So, in general association rule mining can be reduced to the following two steps: • Find all frequent itemsets • Each itemset will occur at least as frequently as as a minimum support count • Generate strong association rules from the frequent itemsets • These rules will satisfy minimum support and confidence measures

  18. Combinatorial Explosion! • A major challenge in mining frequent itemsets is that the number of frequent itemsets generated can be massive • For example, a long frequent itemset will contain a combinatorial number of shorter frequent sub-itemsets • A frequent itemset of length 100 will contains the following number of frequent sub-itemsets:

  19. The Apriori Algorithm • Any subset of a frequent itemset must be frequent • If {beer, nappy, nuts} is frequent, so is {beer, nappy} • Every transaction having {beer, nappy, nuts} also contains {beer, nappy} • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!

  20. The Apriori Algorithm (cont…) • The Apriori algorithm is known as a candidate generation-and-test approach • Method: • Generate length (k+1) candidate itemsets from length k frequent itemsets • Test the candidates against the DB • Performance studies show the algorithm’s efficiency and scalability

  21. The Apriori Algorithm: An Example Database TDB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan

  22. Important Details Of The Apriori Algorithm • There are two crucial questions in implementing the Apriori algorithm: • How to generate candidates? • How to count supports of candidates?

  23. Generating Candidates • There are 2 steps to generating candidates: • Step 1: Self-joining Lk • Step 2: Pruning • Example of Candidate-generation • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd}

  24. How to Count Supports Of Candidates? • Why counting supports of candidates a problem? • The total number of candidates can be huge • One transaction may contain many candidates • Method: • Candidate itemsets are stored in a hash-tree • Leaf node of hash-tree contains a list of itemsets and counts • Interior node contains a hash table • Subset function: finds all the candidates contained in a transaction

  25. Generating Association Rules • Once all frequent itemsets have been found association rules can be generated • Strong association rules from a frequent itemset are generated by calculating the confidence in each possible rule arising from that itemset and testing it against a minimum confidence threshold

  26. Example

  27. Example

  28. Challenges Of Frequent Pattern Mining • Challenges • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates • Improving Apriori: general ideas • Reduce passes of transaction database scans • Shrink number of candidates • Facilitate support counting of candidates

  29. Bottleneck Of Frequent-Pattern Mining • Multiple database scans are costly • Mining long patterns needs many passes of scanning and generates lots of candidates • To find frequent itemset i1i2…i100 • # of scans: 100 • # of Candidates: + + … + = 2100-1 = 1.27*1030 • Bottleneck: candidate-generation-and-test

  30. Mining Frequent Patterns Without Candidate Generation • Techniques for mining frequent itemsets which avoid candidate generation include: • FP-growth • Grow long patterns from short ones using local frequent items • ECLAT (Equivalence CLASS Transformation) algorithm • Uses a data representation in which transactions are associated with items, rather than the other way around (vertical data format) • These methods can be much faster than the Apriori algorithm

  31. Sequence Databases and Sequential Pattern Analysis • Frequent patterns vs. (frequent) sequential patterns • Applications of sequential pattern mining • Customer shopping sequences: • First buy computer, then CD-ROM, and then digital camera, within 3 months. • Medical treatment, natural disasters (e.g., earthquakes), science & engineering processes, stocks and markets, etc. • Telephone calling patterns, Weblog click streams • DNA sequences and gene structures

  32. What Is Sequential Pattern Mining? • Given a set of sequences, find the complete set of frequentsubsequences A sequence : < (ef) (ab) (df) c b > A sequence database An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern

  33. Challenges On Sequential Pattern Mining • A huge number of possible sequential patterns are hidden in databases • A mining algorithm should • Find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold • Be highly efficient, scalable, involving only a small number of database scans • Be able to incorporate various kinds of user-specific constraints

  34. Seq. ID • Sequence • 10 • <(bd)cb(ac)> • 20 • <(bf)(ce)b(fg)> • 30 • <(ah)(bf)abf> • 40 • <(be)(ce)d> • 50 • <a(bd)bcb(ade)> A Basic Property Of Sequential Patterns: Apriori • A basic property: Apriori • If a sequence S is not frequent • Then none of the super-sequences of S is frequent • E.g, <hb> is infrequent  so are <hab> and <(ah)b> Given support thresholdmin_sup =2

  35. GSP—A Generalized Sequential Pattern Mining Algorithm • GSP (Generalized Sequential Pattern) mining algorithm proposed in 1996 • Outline of the method • Initially, every item in DB is a candidate of length 1 • For each level (i.e., sequences of length k): • Scan database to collect support count for each candidate sequence • Generate candidate length (k+1) sequences from length k frequent sequences using Apriori • Repeat until no frequent sequence or no candidate can be found • Major strength: Candidate pruning by Apriori

  36. Seq. ID • Sequence • 10 • <(bd)cb(ac)> min_sup =2 • 20 • <(bf)(ce)b(fg)> • 30 • <(ah)(bf)abf> • 40 • <(be)(ce)d> • 50 • <a(bd)bcb(ade)> Finding Length 1 Sequential Patterns • Examine GSP using an example • Initial candidates: all singleton sequences • <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> • Scan database once, count support for candidates

  37. Generating Length 2 Candidates 51 length-2 Candidates Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates

  38. Finding Length 2 Sequential Patterns • Scan database one more time, collect support count for each length 2 candidate • There are 19 length 2 candidates which pass the minimum support threshold • They are length 2 sequential patterns

  39. Generating Length 3 Candidates And Finding Length 3 Patterns • Generate length 3 candidates • Self-join length 2 sequential patterns • Based on the Apriori property • <ab>, <aa> and <ba> are all length 2 sequential patterns  <aba> is a length-3 candidate • <(bd)>, <bb> and <db> are all length 2 sequential patterns  <(bd)b> is a length-3 candidate • 46 candidates are generated • Find length 3 sequential patterns • Scan database once more, collect support counts for candidates • 19 out of 46 candidates pass support threshold

  40. Seq. ID • Sequence Cand. cannot pass sup. threshold 5th scan: 1 cand. 1 length-5 seq. pat. <(bd)cba> • 10 • <(bd)cb(ac)> • 20 • <(bf)(ce)b(fg)> Cand. not in DB at all <abba> <(bd)bc> … 4th scan: 8 cand. 6 length-4 seq. pat. • 30 • <(ah)(bf)abf> 3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all <abb> <aab> <aba> <baa><bab> … • 40 • <(be)(ce)d> 2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all • 50 • <a(bd)bcb(ade)> <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> 1st scan: 8 cand. 6 length-1 seq. pat. <a> <b> <c> <d> <e> <f> <g> <h> The GSP Mining Process min_sup =2

  41. The GSP Algorithm • Take sequences in form of <x> as length 1 candidates • Scan database once, find F1, the set of length 1 sequential patterns • Let k=1; while Fk is not empty do • Form Ck+1, the set of length (k+1) candidates from Fk • If Ck+1 is not empty, scan database once, find Fk+1, the set of length (k+1) sequential patterns • Let k=k+1

  42. Bottlenecks of GSP • A huge set of candidates could be generated • 1,000 frequent length 1 sequences generate length 2 candidates! • Multiple scans of database in mining • Real challenge: mining long sequential patterns • An exponential number of short candidates • A length-100 sequential pattern needs 1030 candidate sequences!

  43. Improvements On GSP • Freespan: • Projection-based: No candidate sequence needs to be generated • But, projection can be performed at any point in the sequence, and the projected sequences will not shrink much • PrefixSpan • Projection-based • But only prefix-based projection: less projections and quickly shrinking sequences

  44. Frequent-Pattern Mining: Achievements • Frequent pattern mining—an important task in data mining • Frequent pattern mining methodology • Candidate generation & test vs. projection-based (frequent-pattern growth) • Various optimization methods: database partition, scan reduction, hash tree, sampling, border computation, clustering, etc. • Related frequent-pattern mining algorithm: scope extension • Mining closed frequent itemsets and max-patterns (e.g., MaxMiner, CLOSET, CHARM, etc.) • Mining multi-level, multi-dimensional frequent patterns with flexible support constraints • Constraint pushing for mining optimization • From frequent patterns to correlation and causality

  45. Frequent-Pattern Mining: Research Problems • Multi-dimensional gradient analysis: patterns regarding changes and differences • Not just counts—other measures, e.g., avg(profit) • Mining top-k frequent patterns without support constraint • Mining fault-tolerant associations • “3 out of 4 courses excellent” leads to A in data mining • Fascicles and database compression by frequent pattern mining • Partial periodic patterns • DNA sequence analysis and pattern classification

  46. Questions? • ?

More Related