slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? PowerPoint Presentation
Download Presentation
4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting?

Loading in 2 Seconds...

play fullscreen
1 / 35

4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? - PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on

Chapter 4: Mining Frequent Patterns, Associations and Correlations. 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary. Frequent Pattern Analysis.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting?' - kasa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Chapter 4: Mining Frequent Patterns, Associations and Correlations

  • 4.1 Basic Concepts
  • 4.2 Frequent Itemset Mining Methods
  • 4.3 Which Patterns Are Interesting?
    • Pattern Evaluation Methods
  • 4.4 Summary
frequent pattern analysis
Frequent Pattern Analysis
  • Frequent Pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
  • Goal: finding inherent regularities in data
    • What products were often purchased together?— Beer and diapers?!
    • What are the subsequent purchases after buying a PC?
    • What kinds of DNA are sensitive to this new drug?
    • Can we automatically classify Web documents?
  • Applications:
    • Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
why is frequent pattern mining important
Why is Frequent Pattern Mining Important?
  • An important property of datasets
  • Foundation for many essential data mining tasks
    • Association, correlation, and causality analysis
    • Sequential, structural (e.g., sub-graph) patterns
    • Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
    • Classification: discriminative, frequent pattern analysis
    • Clustering analysis: frequent pattern-based clustering
    • Data warehousing: iceberg cube and cube-gradient
    • Semantic data compression
    • Broad applications
frequent patterns
Frequent Patterns
  • itemset: A set of one or more items
  • K-itemset X = {x1, …, xk}
  • (absolute) support, or, support count of X: Frequency or occurrence of an itemset X
  • (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)
  • An itemset X is frequent if X’s support is no less than a minsup threshold

Customer

buys both

Customer

buys diaper

Customer

buys beer

association rules
Association Rules
  • Find all the rules X  Ywith minimum support and confidence threshold
    • support, s, probability that atransaction contains X  Y
    • confidence, c, conditional probability that a transaction having X also contains Y

Let minsup = 50%, minconf = 50%

Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3

  • Association rules: (many more!)
    • Beer  Diaper (60%, 100%)
    • Diaper  Beer (60%, 75%)
  • Rules that satisfy both minsup and minconf are called strong rules

Customer

buys both

Customer

buys diaper

Customer

buys beer

closed patterns and max patterns
Closed Patterns and Max-Patterns
  • A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100}contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns!
  • Solution: Mine closed patterns and max-patterns instead
  • An itemset Xis closedif X is frequent and there exists no super-pattern Y כ X, with the same support as X
  • An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X
  • Closed pattern is a lossless compression of freq. patterns
    • Reducing the number of patterns and rules
closed patterns and max patterns1
Closed Patterns and Max-Patterns

Example

  • DB = {<a1, …, a100>, < a1, …, a50>}
  • Min_sup=1
  • What is the set of closed itemset?
    • <a1, …, a100>: 1
    • < a1, …, a50>: 2
  • What is the set of max-pattern?
    • <a1, …, a100>: 1
  • What is the set of all patterns?
    • !!
computational complexity
Computational Complexity
  • How many itemsets are potentially to be generated in the worst case?
    • The number of frequent itemsets to be generated is sensitive to the minsup threshold
    • When minsup is low, there exist potentially an exponential number of frequent itemsets
    • The worst case: MN where M: # distinct items, and N: max length of transactions
slide9

Chapter 4: Mining Frequent Patterns, Associations and Correlations

  • 4.1 Basic Concepts
  • 4.2 Frequent Itemset Mining Methods

4.2.1 Apriori: A Candidate Generation-and-Test Approach

4.2.2 Improving the Efficiency of Apriori

4.2.3 FPGrowth: A Frequent Pattern-Growth Approach

4.2.4 ECLAT: Frequent Pattern Mining with Vertical Data Forma

  • 4.3 Which Patterns Are Interesting?
    • Pattern Evaluation Methods
  • 4.4 Summary
4 2 1apriori concepts and principle
4.2.1Apriori: Concepts and Principle
  • The downward closure property of frequent patterns
    • Any subset of a frequent itemset must be frequent
    • If {beer, diaper, nuts} is frequent, so is {beer, diaper}
    • i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper}
  • Apriori pruning principle:If there isanyitemset which is infrequent, its superset should not be generated/tested
4 2 1apriori method
4.2.1Apriori: Method
  • Initially, scan DB once to get frequent 1-itemset
  • Generate length (k+1) candidate itemsets from length k frequent itemsets
  • Test the candidates against DB
  • Terminate when no frequent or candidate set can be generated
apriori example
Apriori: Example

Supmin = 2

Database

L1

C1

1st scan

C2

C2

L2

2nd scan

L3

C3

3rd scan

apriori algorithm
Apriori Algorithm

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};

for(k = 1; Lk !=; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end

returnkLk;

candidate generation
Candidate Generation
  • How to generate candidates?
    • Step 1: self-joining Lk
    • Step 2: pruning
  • Example of Candidate Generation
    • L3={abc, abd, acd, ace, bcd}
    • Self-joining: L3*L3: abcdfrom abc and abd, acde from acd and ace
    • Pruning: acde is removed because adeis not in L3
    • C4 = {abcd}
4 2 2 generating association rules
4.2.2 Generating Association Rules
  • Once the frequent itemsets have been found, it is straightforward to generate strong association rules that satisfy:
    • minimum Support
    • minimum confidence
  • Relation between support and confidence:

support_count(AB)

Confidence(AB) = P(B|A)=

support_count(A)

    • Support_count(AB) is the number of transactions containing the itemsets A  B
    • Support_count(A) is the number of transactions containing the itemset A.
generating association rules
Generating Association Rules
  • For each frequent itemset L, generate all non empty subsets of L
  • For every no empty subset S of L, output the rule:

S  (L-S)

If (support_count(L)/support_count(S)) >= min_conf

example
Example
  • Suppose the frequent Itemset L={I1,I2,I5}
  • Subsets of L are: {I1,I2},

{I1,I5},{I2,I5},{I1},{I2},{I5}

  • Association rules :

I1  I2  I5 confidence = 2/4= 50%

I1  I5  I2 confidence=2/2=100%

I2  I5  I1 confidence=2/2=100%

I1  I2  I5 confidence=2/6=33%

I2  I1  I5 confidence=2/7=29%

I5  I2  I2 confidence=2/2=100%

If the minimum confidence =70%

Question: what is the difference between association rules and decision tree rules?

Transactional Database

4 2 2 improving the efficiency of apriori
4.2.2 Improving the Efficiency of Apriori
  • Major computational challenges
    • Huge number of candidates
    • Multiple scans of transaction database
    • Tedious workload of support counting for candidates
  • Improving Apriori: general ideas
    • Shrink number of candidates
    • Reduce passes of transaction database scans
    • Facilitate support counting of candidates
a dhp hash based technique
(A) DHP: Hash-based Technique

Database

Itemset

sup

{A}

2

C1

{B}

3

{C}

3

1st scan

{D}

1

{E}

3

Making a hash table

10: {A,C}, {A, D}, {C,D} 20: {B,C}, {B, E}, {C,E}30: {A,B}, {A, C}, {A,E}, {B,C}, {B, E}, {C,E}40: {B, E}

Hash codes

Buckets

Buckets

counters

1 0 1 0 1 0 1

{A,B} 1

{A, C} 3

{A,E} 3

{B,C} 2

{B, E} 3

{C,E} 3

min-support=2

We have the following

binary vector

{A, C}

{B,C}

{B, E}

{C,E}

J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. SIGMOD’95

b partition scan database only twice
(B) Partition: Scan Database Only Twice
  • Subdivide the transactions of D into k non overlapping partitions
  • Ifσ is the min_sup of D, the min-sup of Diisσi,= σ|Di|
  • Any itemset that is potentially frequent in D must be frequent in at least one of the partition of Di
  • Each partition can fit into main memory, thus it is read only once
  • Steps:
    • Scan1: partition database and find local frequent patterns
    • Scan2: consolidate global frequent patterns

D1

+

D2

+

+

Dk

= D

σ2 = σ|D2|

σk = σ|Dk|

σ1= σ|D1|

  • A. Savasere, E. Omiecinski and S. Navathe, VLDB’95
c sampling for frequent patterns
(C) Sampling for Frequent Patterns
  • Select a sample of the original database
  • Mine frequent patterns within the sample using Apriori
  • Use a lower support threshold than the minimum support to find local frequent itemsets
  • Scan the database once to verify the frequent itemsets found in the sample
  • Only broader frequent patterns are checked
    • Example: check abcd instead of ab, ac,…, etc.
  • Scan the database again to find missed frequent patterns

H. Toivonen. Sampling large databases for association rules. In VLDB’96

d dynamic reduce number of scans
(D) Dynamic: Reduce Number of Scans
  • Once both A and D are determined frequent, the counting of AD begins
  • Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins

ABCD

ABC

ABD

ACD

BCD

Transactions

AB

AC

BC

AD

BD

CD

1-itemsets

2-itemsets

Apriori

B

C

D

A

1-itemsets

{}

2-items

Itemset lattice

DIC

3-items

  • S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97
4 2 3 fp growth frequent pattern growth
4.2.3 FP-Growth: Frequent Pattern-Growth
  • Adopts a divide and conquer strategy
  • Compress the database representing frequent items into a frequent –pattern tree or FP-tree
    • Retains the itemset association information
  • Divid the compressed database into a set of conditional databases, each associated with one frequent item
  • Mine each such databases separately
example fp growth
Example: FP-Growth
  • The first scan of data is the same as Apriori
  • Derive the set of frequent 1-itemsets
  • Let min-sup=2
  • Generate a set of ordered items

Transactional Database

construct the fp tree
Construct the FP-Tree

Transactional Database

- Create a branch for each transaction

- Items in each transaction are processed in order

1- Order the items T100: {I2,I1,I5}

2- Construct the first branch:

<I2:1>, <I1:1>,<I5:1>

null

I2:1

I1:1

I5:1

construct the fp tree1
Construct the FP-Tree

Transactional Database

- Create a branch for each transaction

- Items in each transaction are processed in order

1- Order the items T200: {I2,I4}

2- Construct the second branch:

<I2:1>, <I4:1>

null

I2:2

I2:1

I4:1

I1:1

I5:1

construct the fp tree2
Construct the FP-Tree

Transactional Database

- Create a branch for each transaction

- Items in each transaction are processed in order

1- Order the items T300: {I2,I3}

2- Construct the third branch:

<I2:2>, <I3:1>

null

I2:2

I2:3

I4:1

I1:1

I3:1

I5:1

construct the fp tree3
Construct the FP-Tree

Transactional Database

- Create a branch for each transaction

- Items in each transaction are processed in order

1- Order the items T400: {I2,I1,I4}

2- Construct the fourth branch:

<I2:3>, <I1:1>,<I4:1>

null

I2:4

I2:3

I4:1

I1:2

I1:1

I3:1

I5:1

I4:1

construct the fp tree4
Construct the FP-Tree

Transactional Database

- Create a branch for each transaction

- Items in each transaction are processed in order

1- Order the items T400: {I1,I3}

2- Construct the fifth branch:

<I1:1>, <I3:1>

null

I1:1

I2:4

I4:1

I1:2

I3:1

I3:1

I5:1

I4:1

construct the fp tree5
Construct the FP-Tree

Transactional Database

null

I1:2

I2:7

I4:1

I1:4

I3:2

I3:2

I5:1

I4:1

When a branch of a transaction is added, the count for each node along a common prefix is incremented by 1

I3:2

I5:1

construct the fp tree6
Construct the FP-Tree

null

I2:7

I1:2

I1:4

I4:1

I3:2

I3:2

I5:1

I4:1

I3:2

I5:1

The problem of mining frequent patterns in databases is transformed to that of mining the FP-tree

construct the fp tree7
Construct the FP-Tree

null

I2:7

I1:2

I1:4

I4:1

I3:2

I3:2

I5:1

I4:1

I3:2

I5:1

-Occurrences of I5:<I2,I1,I5> and <I2,I1,I3,I5>

-Two prefix Paths<I2, I1: 1> and <I2,I1,I3: 1>

-Conditional FP tree contains only <I2: 2, I1: 2>, I3 is not considered because its support count of 1 is less than the minimum support count.

-Frequent patterns {I2,I5:2}, {I1,I5:2},{I2,I1,I5:2}

construct the fp tree8
Construct the FP-Tree

null

I2:7

I1:2

Item ID

Support count

I1:4

I2

7

I4:1

I3:2

I1

6

I3:2

I3

6

I5:1

I4

2

I4:1

I5

2

I3:2

I5:1

construct the fp tree9
Construct the FP-Tree

null

I2:7

I1:2

Item ID

Support count

I1:4

I2

7

I4:1

I3:2

I1

6

I3:2

I3

6

I5:1

I4

2

I4:1

I5

2

I3:2

I5:1

fp growth properties
FP-growth properties
  • FP-growth transforms the problem of finding long frequent patterns to searching for shorter once recursively and the concatenating the suffix
  • It uses the least frequent suffix offering a good selectivity
  • It reduces the search cost
  • If the tree does not fit into main memory, partition the database
  • Efficient and scalable for mining both long and short frequent patterns