Loading in 5 sec....

The FP-Growth/Apriori DebatePowerPoint Presentation

The FP-Growth/Apriori Debate

- 349 Views
- Updated On :

The FP-Growth/Apriori Debate. Jeffrey R. Ellis CSE 300 – 01 April 11, 2002. Presentation Overview. FP-Growth Algorithm Refresher & Example Motivation FP-Growth Complexity Vs. Apriori Complexity Saving calculation or hiding work? Real World Application Datasets are not created equal

Related searches for the fp-growth

Download Presentation
## PowerPoint Slideshow about 'the fp-growth' - libitha

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Presentation Overview

- FP-Growth Algorithm
- Refresher & Example
- Motivation

- FP-Growth Complexity
- Vs. Apriori Complexity
- Saving calculation or hiding work?

- Real World Application
- Datasets are not created equal
- Results of real-world implementation

FP-Growth Algorithm

- Association Rule Mining
- Generate Frequent Itemsets
- Apriori generates candidate sets
- FP-Growth uses specialized data structures (no candidate sets)

- Find Association Rules
- Outside the scope of both FP-Growth & Apriori

- Generate Frequent Itemsets
- Therefore, FP-Growth is a competitor to Apriori

Item frequency head

f 4

c 4

a 3

b 3

m 3

p 3

{}

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

FP-Growth ExampleTID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p}{f, c, a, m, p}

200 {a, b, c, f, l, m, o}{f, c, a, b, m}

300 {b, f, h, j, o}{f, b}

400 {b, c, k, s, p}{c, b, p}

500{a, f, c, e, l, p, m, n}{f, c, a, m, p}

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

Conditional pattern-base

Conditional FP-tree

p

{(fcam:2), (cb:1)}

{(c:3)}|p

m

{(fca:2), (fcab:1)}

{(f:3, c:3, a:3)}|m

b

{(fca:1), (f:1), (c:1)}

Empty

a

{(fc:3)}

{(f:3, c:3)}|a

c

{(f:3)}

{(f:3)}|c

f

Empty

Empty

FP-Growth ExampleInput: DB, min_support

Output: FP-Tree

Scan DB & count all frequent items.

Create null root & set as current node.

For each Transaction T

Sort T’s items.

For each sorted Item I

Insert I into tree as a child of current node.

Connect new tree node to header list.

Two passes through DB

Tree creation is based on number of items in DB.

Complexity of CreateTree is O(|DB|)

FP-Growth AlgorithmItem frequency head

f 4

c 4

a 3

b 3

m 3

p 3

{}

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

FP-Growth ExampleTID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p}{f, c, a, m, p}

200 {a, b, c, f, l, m, o}{f, c, a, b, m}

300 {b, f, h, j, o}{f, b}

400 {b, c, k, s, p}{c, b, p}

500{a, f, c, e, l, p, m, n}{f, c, a, m, p}

Input: FP-Tree, f_is (frequent itemset)

Output: All freq_patterns

if FP-Tree contains single Path P

then for each combination of nodes in P

generate pattern f_is combination

else for each item i in header

generate pattern f_is i

construct pattern’s conditional cFP-Tree

if (FP-Tree 0)

then call FP-Growth (cFP-Tree, pattern)

Recursive algorithm creates FP-Tree structures and calls FP-Growth

Claims no candidate generation

FP-Growth AlgorithmB

C

D

Two-Part Algorithmif FP-Tree contains single Path P

then for each combination of nodes in P

generate pattern f_is combination

- e.g., { A B C D }
- p = |pattern| = 4
- AD, BD, CD, ABD, ACD, BCD, ABCD

- (n=1 to p-1) (p-1Cn)

B

C

D

C

D

E

D

E

E

Two-Part Algorithmelse for each item i in header

generate pattern f_is i

construct pattern’s conditional cFP-Tree

if (cFP-Tree null)

then call FP-Growth (cFP-Tree, pattern)

- e.g., { A B C D E } for f_is = D
- i = A, p_base = (ABCD), (ACD), (AD)
- i = B, p_base = (ABCD)
- i = C, p_base = (ABCD), (ACD)
- i = E, p_base = null

- Pattern bases are generated by following f_is path from header to each node in tree having it, up to tree root, for each header item.

FP-Growth Complexity

- Therefore, each path in the tree will be at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header.
- Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree)
- Creation of a new cFP-Tree occurs also.

Algorithm Results (in English)

- Candidate Generation sets exchanged for FP-Trees.
- You MUST take into account all paths that contain an item set with a test item.
- You CANNOT determine before a conditional FP-Tree is created if new frequent item sets will occur.
- Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.

Transactions:

A (49)

B (44)

F (39)

G (34)

H (30)

I (20)

ABGM

BGIM

FGIM

GHIM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

FP-Growth Mining ExampleHeader:

A

B

F

G

H

IM

H:30

I:20

FP-Growth Mining Example

Null

i = A

pattern = { A M }

support = 1

pattern base

cFP-Tree = null

Header:

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

FP-Growth Mining Example

Null

i = B

pattern = { B M }

support = 2

pattern base

cFP-Tree …

Header:

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

B:2

M:2

FP-Growth Mining ExampleHeader:

G

B

M

Patterns mined:

BM

GM

BGM

freq_itemset = { B M }

All patterns: BM, GM, BGM

FP-Growth Mining Example

Null

i = F

pattern = { F M }

support = 1

pattern base

cFP-Tree = null

Header:

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

FP-Growth Mining Example

Null

i = G

pattern = { G M }

support = 4

pattern base

cFP-Tree …

Header:

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

FP-Growth Mining Example

Null

i = I

pattern = { I G M }

support = 3

pattern base

cFP-Tree …

Header:

I

B

G

M

I:3

B:1

B:1

G:2

G:1

G:1

M:2

M:1

M:1

freq_itemset = { G M }

G:3

M:3

FP-Growth Mining ExampleHeader:

I

G

M

Patterns mined:

IM

GM

IGM

freq_itemset = { I G M }

All patterns: BM, GM, BGM, IM, GM, IGM

FP-Growth Mining Example

Null

i = B

pattern = { B G M }

support = 2

pattern base

cFP-Tree …

Header:

I

B

G

M

I:3

B:1

B:1

G:2

G:1

G:1

M:2

M:1

M:1

freq_itemset = { G M }

G:2

M:2

FP-Growth Mining ExampleHeader:

B

G

M

Patterns mined:

BM

GM

BGM

freq_itemset = { B G M }

All patterns: BM, GM, BGM, IM, GM, IGM,

BM, GM, BGM

FP-Growth Mining Example

Null

i = H

pattern = { H M }

support = 1

pattern base

cFP-Tree = null

Header:

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

FP-Growth Mining Example

- Complete pass for { M }
- Move onto { I }, { H }, etc.
- Final Frequent Sets:
- L2 : { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG }
- L3 : { BGM, GIM, BGM, GIM }
- L4 : None

FP-Growth Redundancy v. Apriori Candidate Sets

- FP-Growth (support=2) generates:
- L2 : 10 sets (5 distinct)
- L3 : 4 sets (2 distinct)
- Total : 14 sets

- Apriori (support=2) generates:
- C2 : 21 sets
- C3 : 2 sets
- Total : 23 sets

- What about support=1?
- Apriori : 23 sets
- FP-Growth : 28 sets in {M} with i = A, B, F alone!

FP-Growth vs. Apriori

- Apriori visits each transaction when generating a new candidate sets; FP-Growth does not
- Can use data structures to reduce transaction list

- FP-Growth traces the set of concurrent items; Apriori generates candidate sets
- FP-Growth uses more complicated data structures & mining techniques

Algorithm Analysis Results

- FP-Growth IS NOT inherently faster than Apriori
- Intuitively, it appears to condense data
- Mining scheme requires some new work to replace candidate set generation
- Recursion obscures the additional effort

- FP-Growth may run faster than Apriori in circumstances
- No guarantee through complexity which algorithm to use for efficiency

Improvements to FP-Growth

- None currently reported
- MLFPT
- Multiple Local Frequent Pattern Tree
- New algorithm that is based on FP-Growth
- Distributes FP-Trees among processors

- No reports of complexity analysis or accuracy of FP-Growth

Real World Applications

- Zheng, Kohavi, Mason – “Real World Performance of Association Rule Algorithms”
- Collected implementations of Apriori, FP-Growth, CLOSET, CHARM, MagnumOpus
- Tested implementations against 1 artificial and 3 real data sets
- Time-based comparisons generated

Apriori & FP-Growth

- Apriori
- Implementation from creator Christian Borgelt (GNU Public License)
- C implementation
- Entire dataset loaded into memory

- FP-Growth
- Implementation from creators Han & Pei
- Version – February 5, 2001

Other Algorithms

- CHARM
- Based on concept of Closed Itemset
- e.g., { A, B, C } – ABC, AC, BC, ACB, AB, CB, etc.

- CLOSET
- Han, Pei implementation of Closed Itemset

- MagnumOpus
- Generates rules directly through search-and-prune technique

Datasets

- IBM-Artificial
- Generated at IBM Almaden (T10I4D100K)
- Often used in association rule mining studies

- BMS-POS
- Years of point-of-sale data from retailer

- BMS-WebView-1 & BMS-WebView-2
- Months of clickstream traffic from e-commerce web sites

Experimental Considerations

- Hardware Specifications
- Dual 550MHz Pentium III Xeon processors
- 1GB Memory

- Support { 1.00%, 0.80%, 0.60%, 0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% }
- Confidence = 0%
- No other applications running (second processor handles system processes)

BMS-POS

BMS-WebView-2

Study Results – Real Data

- At support 0.20%, Apriori performs as fast as or better than FP-Growth
- At support < 0.20%, Apriori completes whenever FP-Growth completes
- Exception – BMS-WebView-2 @ 0.01%

- When 2 million rules are generated, Apriori finishes in 10 minutes or less
- Proposed – Bottleneck is NOT the rule algorithm, but rule analysis

Study Results – Artificial Data

- At support < 0.40%, FP-Growth performs MUCH faster than Apriori
- At support 0.40%, FP-Growth and Apriori are comparable

Real-World Study Conclusions

- FP-Growth (and other non-Apriori) perform better on artificial data
- On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets
- FP-Growth may be suitable when low support, large result count, fast generation are needed
- Future research may best be directed toward analyzing association rules

Research Conclusions

- FP-Growth does not have a better complexity than Apriori
- Common sense indicates it will run faster

- FP-Growth does not always have a better running time than Apriori
- Support, dataset appears more influential

- FP-Trees are very complex structures (Apriori is simple)
- Location of data (memory vs. disk) is non-factor in comparison of algorithms

To Use or Not To Use?

- Question: Should the FP-Growth be used in favor of Apriori?
- Difficulty to code
- High performance at extreme cases
- Personal preference

- More relevant questions
- What kind of data is it?
- What kind of results do I want?
- How will I analyze the resulting rules?

References

- Han, Jiawei, Pei, Jian, and Yin, Yiwen, “Mining Frequent Patterns without Candidate Generation”. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000.
- Orlando, Salvatore, “High Performance Mining of Short and Long Patterns”. 2001.
- Pei, Jian, Han, Jiawei, and Mao, Runying, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000.
- Webb, Geoffrey L., “Efficient search for association rules”. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000.
- Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, “Fast Parallel Association Rule Mining Without Candidacy Generation”. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001
- Zaki, Mohammed J., “Generating Non-Redundant Association Rules”. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.
- Zheng, Zijian, Kohavi, Ron, and Mason, Llew, “Real World Performance of Association Rule Algorithms”. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.

Download Presentation

Connecting to Server..