Loading in 2 Seconds...

Business Systems Intelligence: 4. Mining Association Rules

Loading in 2 Seconds...

- By
**gada** - Follow User

- 121 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Business Systems Intelligence: 4. Mining Association Rules' - gada

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Acknowledgments

- These notes are based (heavily) on those provided by the authors to accompany “Data Mining: Concepts & Techniques” by Jiawei Han and Micheline Kamber
- Some slides are also based on trainer’s kits provided by

More information about the book is available at:www-sal.cs.uiuc.edu/~hanj/bk2/

And information on SAS is available at:www.sas.com

Mining Association Rules

- Today we will look at:
- Association rule mining
- Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases
- Sequential pattern mining
- Applications/extensions of frequent pattern mining
- Summary

Frequent Pattern: A pattern (set of items, sequence, etc.) that occurs frequently in a database

What Is Association Mining?- Association rule mining:
- Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories

Motivations For Association Mining

- Motivation: Finding regularities in data
- What products were often purchased together?
- Beer and nappies!
- What are the subsequent purchases after buying a PC?
- What kinds of DNA are sensitive to this new drug?
- Can we automatically classify web documents?

Motivations For Association Mining (cont…)

- Foundation for many essential data mining tasks
- Association, correlation, causality
- Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association
- Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression)

Motivations For Association Mining (cont…)

- Broad applications
- Basket data analysis, cross-marketing, catalog design, sale campaign analysis
- Web log (click stream) analysis, DNA sequence analysis, etc.

Market Basket Analysis

- Market basket analysis is a typical example of frequent itemset mining
- Customers buying habits are divined by finding associations between different items that customers place in their “shopping baskets”
- This information can be used to develop marketing strategies

Association Rule Basic Concepts

- Let I be a set of items {I1, I2, I3,…, Im}
- Let D be a database of transactions where each transaction T is a set of items such that TI
- So, if A is a set of items a transaction T is said to contain A if and only if A T
- An association rule is an implication AB where AI, BI, and A B=

Association Rule Support & Confidence

- We say that an association rule AB holds in the transaction set D with support, s, and confidence, c
- The support of the association rule is given as the percentage of transactions in D that contain both A and B (or A B)
- So, the support can be considered the probability P(A B)

Association Rule Support & Confidence (cont…)

- The confidence of the association rule is given as the percentage of transactions in D containing A that also contain B
- So, the confidence can be considered the conditional probability P(B|A)
- Association rules that satisfy minimum support and confidence values are said to be strong

Itemsets & Frequent Itemsets

- An itemset is a set of items
- A k-itemset is an itemset that contains k items
- The occurrence frequency of an itemset is the number of transactions that contain the itemset
- This is also known more simply as the frequency, support count or count
- An itemset is said to be frequent if the support count satisfies a minimum support count threshold
- The set of frequent itemsets is denoted Lk

Support & Confidence Again

- Support and confidence values can be calculated as follows:

Association Rule Mining

- So, in general association rule mining can be reduced to the following two steps:
- Find all frequent itemsets
- Each itemset will occur at least as frequently as as a minimum support count
- Generate strong association rules from the frequent itemsets
- These rules will satisfy minimum support and confidence measures

Combinatorial Explosion!

- A major challenge in mining frequent itemsets is that the number of frequent itemsets generated can be massive
- For example, a long frequent itemset will contain a combinatorial number of shorter frequent sub-itemsets
- A frequent itemset of length 100 will contains the following number of frequent sub-itemsets:

The Apriori Algorithm

- Any subset of a frequent itemset must be frequent
- If {beer, nappy, nuts} is frequent, so is {beer, nappy}
- Every transaction having {beer, nappy, nuts} also contains {beer, nappy}
- Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!

The Apriori Algorithm (cont…)

- The Apriori algorithm is known as a candidate generation-and-test approach
- Method:
- Generate length (k+1) candidate itemsets from length k frequent itemsets
- Test the candidates against the DB
- Performance studies show the algorithm’s efficiency and scalability

Important Details Of The Apriori Algorithm

- There are two crucial questions in implementing the Apriori algorithm:
- How to generate candidates?
- How to count supports of candidates?

Generating Candidates

- There are 2 steps to generating candidates:
- Step 1: Self-joining Lk
- Step 2: Pruning
- Example of Candidate-generation
- L3={abc, abd, acd, ace, bcd}
- Self-joining: L3*L3
- abcd from abc and abd
- acde from acd and ace
- Pruning:
- acde is removed because ade is not in L3
- C4={abcd}

How to Count Supports Of Candidates?

- Why counting supports of candidates a problem?
- The total number of candidates can be huge
- One transaction may contain many candidates
- Method:
- Candidate itemsets are stored in a hash-tree
- Leaf node of hash-tree contains a list of itemsets and counts
- Interior node contains a hash table
- Subset function: finds all the candidates contained in a transaction

Generating Association Rules

- Once all frequent itemsets have been found association rules can be generated
- Strong association rules from a frequent itemset are generated by calculating the confidence in each possible rule arising from that itemset and testing it against a minimum confidence threshold

Challenges Of Frequent Pattern Mining

- Challenges
- Multiple scans of transaction database
- Huge number of candidates
- Tedious workload of support counting for candidates
- Improving Apriori: general ideas
- Reduce passes of transaction database scans
- Shrink number of candidates
- Facilitate support counting of candidates

Bottleneck Of Frequent-Pattern Mining

- Multiple database scans are costly
- Mining long patterns needs many passes of scanning and generates lots of candidates
- To find frequent itemset i1i2…i100
- # of scans: 100
- # of Candidates: + + … + = 2100-1 = 1.27*1030
- Bottleneck: candidate-generation-and-test

Mining Frequent Patterns Without Candidate Generation

- Techniques for mining frequent itemsets which avoid candidate generation include:
- FP-growth
- Grow long patterns from short ones using local frequent items
- ECLAT (Equivalence CLASS Transformation) algorithm
- Uses a data representation in which transactions are associated with items, rather than the other way around (vertical data format)
- These methods can be much faster than the Apriori algorithm

Sequence Databases and Sequential Pattern Analysis

- Frequent patterns vs. (frequent) sequential patterns
- Applications of sequential pattern mining
- Customer shopping sequences:
- First buy computer, then CD-ROM, and then digital camera, within 3 months.
- Medical treatment, natural disasters (e.g., earthquakes), science & engineering processes, stocks and markets, etc.
- Telephone calling patterns, Weblog click streams
- DNA sequences and gene structures

What Is Sequential Pattern Mining?

- Given a set of sequences, find the complete set of frequentsubsequences

A sequence : < (ef) (ab) (df) c b >

A sequence database

An element may contain a set of items.

Items within an element are unordered

and we list them alphabetically.

<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern

Challenges On Sequential Pattern Mining

- A huge number of possible sequential patterns are hidden in databases
- A mining algorithm should
- Find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold
- Be highly efficient, scalable, involving only a small number of database scans
- Be able to incorporate various kinds of user-specific constraints

- Sequence

- 10

- <(bd)cb(ac)>

- 20

- <(bf)(ce)b(fg)>

- 30

- <(ah)(bf)abf>

- 40

- <(be)(ce)d>

- 50

- <a(bd)bcb(ade)>

- A basic property: Apriori
- If a sequence S is not frequent
- Then none of the super-sequences of S is frequent
- E.g, <hb> is infrequent so are <hab> and <(ah)b>

Given support thresholdmin_sup =2

GSP—A Generalized Sequential Pattern Mining Algorithm

- GSP (Generalized Sequential Pattern) mining algorithm proposed in 1996
- Outline of the method
- Initially, every item in DB is a candidate of length 1
- For each level (i.e., sequences of length k):
- Scan database to collect support count for each candidate sequence
- Generate candidate length (k+1) sequences from length k frequent sequences using Apriori
- Repeat until no frequent sequence or no candidate can be found
- Major strength: Candidate pruning by Apriori

- Sequence

- 10

- <(bd)cb(ac)>

min_sup =2

- 20

- <(bf)(ce)b(fg)>

- 30

- <(ah)(bf)abf>

- 40

- <(be)(ce)d>

- 50

- <a(bd)bcb(ade)>

- Examine GSP using an example
- Initial candidates: all singleton sequences
- <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>
- Scan database once, count support for candidates

Generating Length 2 Candidates

51 length-2

Candidates

Without Apriori property,

8*8+8*7/2=92 candidates

Apriori prunes

44.57% candidates

Finding Length 2 Sequential Patterns

- Scan database one more time, collect support count for each length 2 candidate
- There are 19 length 2 candidates which pass the minimum support threshold
- They are length 2 sequential patterns

Generating Length 3 Candidates And Finding Length 3 Patterns

- Generate length 3 candidates
- Self-join length 2 sequential patterns
- Based on the Apriori property
- <ab>, <aa> and <ba> are all length 2 sequential patterns <aba> is a length-3 candidate
- <(bd)>, <bb> and <db> are all length 2 sequential patterns <(bd)b> is a length-3 candidate
- 46 candidates are generated
- Find length 3 sequential patterns
- Scan database once more, collect support counts for candidates
- 19 out of 46 candidates pass support threshold

- Sequence

Cand. cannot pass sup. threshold

5th scan: 1 cand. 1 length-5 seq. pat.

<(bd)cba>

- 10

- <(bd)cb(ac)>

- 20

- <(bf)(ce)b(fg)>

Cand. not in DB at all

<abba> <(bd)bc> …

4th scan: 8 cand. 6 length-4 seq. pat.

- 30

- <(ah)(bf)abf>

3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all

<abb> <aab> <aba> <baa><bab> …

- 40

- <(be)(ce)d>

2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all

- 50

- <a(bd)bcb(ade)>

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>

1st scan: 8 cand. 6 length-1 seq. pat.

<a> <b> <c> <d> <e> <f> <g> <h>

The GSP Mining Processmin_sup =2

The GSP Algorithm

- Take sequences in form of <x> as length 1 candidates
- Scan database once, find F1, the set of length 1 sequential patterns
- Let k=1; while Fk is not empty do
- Form Ck+1, the set of length (k+1) candidates from Fk
- If Ck+1 is not empty, scan database once, find Fk+1, the set of length (k+1) sequential patterns
- Let k=k+1

Bottlenecks of GSP

- A huge set of candidates could be generated
- 1,000 frequent length 1 sequences generate length 2 candidates!
- Multiple scans of database in mining
- Real challenge: mining long sequential patterns
- An exponential number of short candidates
- A length-100 sequential pattern needs 1030 candidate sequences!

Improvements On GSP

- Freespan:
- Projection-based: No candidate sequence needs to be generated
- But, projection can be performed at any point in the sequence, and the projected sequences will not shrink much
- PrefixSpan
- Projection-based
- But only prefix-based projection: less projections and quickly shrinking sequences

Frequent-Pattern Mining: Achievements

- Frequent pattern mining—an important task in data mining
- Frequent pattern mining methodology
- Candidate generation & test vs. projection-based (frequent-pattern growth)
- Various optimization methods: database partition, scan reduction, hash tree, sampling, border computation, clustering, etc.
- Related frequent-pattern mining algorithm: scope extension
- Mining closed frequent itemsets and max-patterns (e.g., MaxMiner, CLOSET, CHARM, etc.)
- Mining multi-level, multi-dimensional frequent patterns with flexible support constraints
- Constraint pushing for mining optimization
- From frequent patterns to correlation and causality

Frequent-Pattern Mining: Research Problems

- Multi-dimensional gradient analysis: patterns regarding changes and differences
- Not just counts—other measures, e.g., avg(profit)
- Mining top-k frequent patterns without support constraint
- Mining fault-tolerant associations
- “3 out of 4 courses excellent” leads to A in data mining
- Fascicles and database compression by frequent pattern mining
- Partial periodic patterns
- DNA sequence analysis and pattern classification

Download Presentation

Connecting to Server..