Loading in 2 Seconds...

4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting?

Loading in 2 Seconds...

- By
**kasa** - Follow User

- 121 Views
- Uploaded on

Download Presentation
## 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting?

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Chapter 4: Mining Frequent Patterns, Associations and Correlations

- 4.1 Basic Concepts
- 4.2 Frequent Itemset Mining Methods
- 4.3 Which Patterns Are Interesting?
- Pattern Evaluation Methods
- 4.4 Summary

Frequent Pattern Analysis

- Frequent Pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
- Goal: finding inherent regularities in data
- What products were often purchased together?— Beer and diapers?!
- What are the subsequent purchases after buying a PC?
- What kinds of DNA are sensitive to this new drug?
- Can we automatically classify Web documents?
- Applications:
- Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

Why is Frequent Pattern Mining Important?

- An important property of datasets
- Foundation for many essential data mining tasks
- Association, correlation, and causality analysis
- Sequential, structural (e.g., sub-graph) patterns
- Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
- Classification: discriminative, frequent pattern analysis
- Clustering analysis: frequent pattern-based clustering
- Data warehousing: iceberg cube and cube-gradient
- Semantic data compression
- Broad applications

Frequent Patterns

- itemset: A set of one or more items
- K-itemset X = {x1, …, xk}
- (absolute) support, or, support count of X: Frequency or occurrence of an itemset X
- (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)
- An itemset X is frequent if X’s support is no less than a minsup threshold

Customer

buys both

Customer

buys diaper

Customer

buys beer

Association Rules

- Find all the rules X Ywith minimum support and confidence threshold
- support, s, probability that atransaction contains X Y
- confidence, c, conditional probability that a transaction having X also contains Y

Let minsup = 50%, minconf = 50%

Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3

- Association rules: (many more!)
- Beer Diaper (60%, 100%)
- Diaper Beer (60%, 75%)
- Rules that satisfy both minsup and minconf are called strong rules

Customer

buys both

Customer

buys diaper

Customer

buys beer

Closed Patterns and Max-Patterns

- A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100}contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns!
- Solution: Mine closed patterns and max-patterns instead
- An itemset Xis closedif X is frequent and there exists no super-pattern Y כ X, with the same support as X
- An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X
- Closed pattern is a lossless compression of freq. patterns
- Reducing the number of patterns and rules

Closed Patterns and Max-Patterns

Example

- DB = {<a1, …, a100>, < a1, …, a50>}
- Min_sup=1
- What is the set of closed itemset?
- <a1, …, a100>: 1
- < a1, …, a50>: 2
- What is the set of max-pattern?
- <a1, …, a100>: 1
- What is the set of all patterns?
- !!

Computational Complexity

- How many itemsets are potentially to be generated in the worst case?
- The number of frequent itemsets to be generated is sensitive to the minsup threshold
- When minsup is low, there exist potentially an exponential number of frequent itemsets
- The worst case: MN where M: # distinct items, and N: max length of transactions

Chapter 4: Mining Frequent Patterns, Associations and Correlations

- 4.1 Basic Concepts
- 4.2 Frequent Itemset Mining Methods

4.2.1 Apriori: A Candidate Generation-and-Test Approach

4.2.2 Improving the Efficiency of Apriori

4.2.3 FPGrowth: A Frequent Pattern-Growth Approach

4.2.4 ECLAT: Frequent Pattern Mining with Vertical Data Forma

- 4.3 Which Patterns Are Interesting?
- Pattern Evaluation Methods
- 4.4 Summary

4.2.1Apriori: Concepts and Principle

- The downward closure property of frequent patterns
- Any subset of a frequent itemset must be frequent
- If {beer, diaper, nuts} is frequent, so is {beer, diaper}
- i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper}
- Apriori pruning principle:If there isanyitemset which is infrequent, its superset should not be generated/tested

4.2.1Apriori: Method

- Initially, scan DB once to get frequent 1-itemset
- Generate length (k+1) candidate itemsets from length k frequent itemsets
- Test the candidates against DB
- Terminate when no frequent or candidate set can be generated

Apriori Algorithm

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};

for(k = 1; Lk !=; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end

returnkLk;

Candidate Generation

- How to generate candidates?
- Step 1: self-joining Lk
- Step 2: pruning
- Example of Candidate Generation
- L3={abc, abd, acd, ace, bcd}
- Self-joining: L3*L3: abcdfrom abc and abd, acde from acd and ace
- Pruning: acde is removed because adeis not in L3
- C4 = {abcd}

4.2.2 Generating Association Rules

- Once the frequent itemsets have been found, it is straightforward to generate strong association rules that satisfy:
- minimum Support
- minimum confidence
- Relation between support and confidence:

support_count(AB)

Confidence(AB) = P(B|A)=

support_count(A)

- Support_count(AB) is the number of transactions containing the itemsets A B
- Support_count(A) is the number of transactions containing the itemset A.

Generating Association Rules

- For each frequent itemset L, generate all non empty subsets of L
- For every no empty subset S of L, output the rule:

S (L-S)

If (support_count(L)/support_count(S)) >= min_conf

Example

- Suppose the frequent Itemset L={I1,I2,I5}
- Subsets of L are: {I1,I2},

{I1,I5},{I2,I5},{I1},{I2},{I5}

- Association rules :

I1 I2 I5 confidence = 2/4= 50%

I1 I5 I2 confidence=2/2=100%

I2 I5 I1 confidence=2/2=100%

I1 I2 I5 confidence=2/6=33%

I2 I1 I5 confidence=2/7=29%

I5 I2 I2 confidence=2/2=100%

If the minimum confidence =70%

Question: what is the difference between association rules and decision tree rules?

Transactional Database

4.2.2 Improving the Efficiency of Apriori

- Major computational challenges
- Huge number of candidates
- Multiple scans of transaction database
- Tedious workload of support counting for candidates
- Improving Apriori: general ideas
- Shrink number of candidates
- Reduce passes of transaction database scans
- Facilitate support counting of candidates

(A) DHP: Hash-based Technique

Database

Itemset

sup

{A}

2

C1

{B}

3

{C}

3

1st scan

{D}

1

{E}

3

Making a hash table

10: {A,C}, {A, D}, {C,D} 20: {B,C}, {B, E}, {C,E}30: {A,B}, {A, C}, {A,E}, {B,C}, {B, E}, {C,E}40: {B, E}

Hash codes

Buckets

Buckets

counters

1 0 1 0 1 0 1

{A,B} 1

{A, C} 3

{A,E} 3

{B,C} 2

{B, E} 3

{C,E} 3

min-support=2

We have the following

binary vector

{A, C}

{B,C}

{B, E}

{C,E}

J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. SIGMOD’95

(B) Partition: Scan Database Only Twice

- Subdivide the transactions of D into k non overlapping partitions
- Ifσ is the min_sup of D, the min-sup of Diisσi,= σ|Di|
- Any itemset that is potentially frequent in D must be frequent in at least one of the partition of Di
- Each partition can fit into main memory, thus it is read only once
- Steps:
- Scan1: partition database and find local frequent patterns
- Scan2: consolidate global frequent patterns

D1

+

D2

+

+

Dk

= D

σ2 = σ|D2|

σk = σ|Dk|

σ1= σ|D1|

- A. Savasere, E. Omiecinski and S. Navathe, VLDB’95

(C) Sampling for Frequent Patterns

- Select a sample of the original database
- Mine frequent patterns within the sample using Apriori
- Use a lower support threshold than the minimum support to find local frequent itemsets
- Scan the database once to verify the frequent itemsets found in the sample
- Only broader frequent patterns are checked
- Example: check abcd instead of ab, ac,…, etc.
- Scan the database again to find missed frequent patterns

H. Toivonen. Sampling large databases for association rules. In VLDB’96

(D) Dynamic: Reduce Number of Scans

- Once both A and D are determined frequent, the counting of AD begins
- Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins

ABCD

ABC

ABD

ACD

BCD

Transactions

AB

AC

BC

AD

BD

CD

1-itemsets

2-itemsets

Apriori

…

B

C

D

A

1-itemsets

{}

2-items

Itemset lattice

DIC

3-items

- S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97

4.2.3 FP-Growth: Frequent Pattern-Growth

- Adopts a divide and conquer strategy
- Compress the database representing frequent items into a frequent –pattern tree or FP-tree
- Retains the itemset association information
- Divid the compressed database into a set of conditional databases, each associated with one frequent item
- Mine each such databases separately

Example: FP-Growth

- The first scan of data is the same as Apriori
- Derive the set of frequent 1-itemsets
- Let min-sup=2
- Generate a set of ordered items

Transactional Database

Construct the FP-Tree

Transactional Database

- Create a branch for each transaction

- Items in each transaction are processed in order

1- Order the items T100: {I2,I1,I5}

2- Construct the first branch:

<I2:1>, <I1:1>,<I5:1>

null

I2:1

I1:1

I5:1

Construct the FP-Tree

Transactional Database

- Create a branch for each transaction

- Items in each transaction are processed in order

1- Order the items T200: {I2,I4}

2- Construct the second branch:

<I2:1>, <I4:1>

null

I2:2

I2:1

I4:1

I1:1

I5:1

Construct the FP-Tree

Transactional Database

- Create a branch for each transaction

- Items in each transaction are processed in order

1- Order the items T300: {I2,I3}

2- Construct the third branch:

<I2:2>, <I3:1>

null

I2:2

I2:3

I4:1

I1:1

I3:1

I5:1

Construct the FP-Tree

Transactional Database

- Create a branch for each transaction

- Items in each transaction are processed in order

1- Order the items T400: {I2,I1,I4}

2- Construct the fourth branch:

<I2:3>, <I1:1>,<I4:1>

null

I2:4

I2:3

I4:1

I1:2

I1:1

I3:1

I5:1

I4:1

Construct the FP-Tree

Transactional Database

- Create a branch for each transaction

- Items in each transaction are processed in order

1- Order the items T400: {I1,I3}

2- Construct the fifth branch:

<I1:1>, <I3:1>

null

I1:1

I2:4

I4:1

I1:2

I3:1

I3:1

I5:1

I4:1

Construct the FP-Tree

Transactional Database

null

I1:2

I2:7

I4:1

I1:4

I3:2

I3:2

I5:1

I4:1

When a branch of a transaction is added, the count for each node along a common prefix is incremented by 1

I3:2

I5:1

Construct the FP-Tree

null

I2:7

I1:2

I1:4

I4:1

I3:2

I3:2

I5:1

I4:1

I3:2

I5:1

The problem of mining frequent patterns in databases is transformed to that of mining the FP-tree

Construct the FP-Tree

null

I2:7

I1:2

I1:4

I4:1

I3:2

I3:2

I5:1

I4:1

I3:2

I5:1

-Occurrences of I5:<I2,I1,I5> and <I2,I1,I3,I5>

-Two prefix Paths<I2, I1: 1> and <I2,I1,I3: 1>

-Conditional FP tree contains only <I2: 2, I1: 2>, I3 is not considered because its support count of 1 is less than the minimum support count.

-Frequent patterns {I2,I5:2}, {I1,I5:2},{I2,I1,I5:2}

Construct the FP-Tree

null

I2:7

I1:2

Item ID

Support count

I1:4

I2

7

I4:1

I3:2

I1

6

I3:2

I3

6

I5:1

I4

2

I4:1

I5

2

I3:2

I5:1

Construct the FP-Tree

null

I2:7

I1:2

Item ID

Support count

I1:4

I2

7

I4:1

I3:2

I1

6

I3:2

I3

6

I5:1

I4

2

I4:1

I5

2

I3:2

I5:1

FP-growth properties

- FP-growth transforms the problem of finding long frequent patterns to searching for shorter once recursively and the concatenating the suffix
- It uses the least frequent suffix offering a good selectivity
- It reduces the search cost
- If the tree does not fit into main memory, partition the database
- Efficient and scalable for mining both long and short frequent patterns

Download Presentation

Connecting to Server..