The FP-Growth/Apriori Debate

1 / 44

# the fp-growth - PowerPoint PPT Presentation

The FP-Growth/Apriori Debate. Jeffrey R. Ellis CSE 300 – 01 April 11, 2002. Presentation Overview. FP-Growth Algorithm Refresher &amp; Example Motivation FP-Growth Complexity Vs. Apriori Complexity Saving calculation or hiding work? Real World Application Datasets are not created equal

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'the fp-growth' - libitha

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### The FP-Growth/Apriori Debate

Jeffrey R. Ellis

CSE 300 – 01

April 11, 2002

Presentation Overview
• FP-Growth Algorithm
• Refresher & Example
• Motivation
• FP-Growth Complexity
• Vs. Apriori Complexity
• Saving calculation or hiding work?
• Real World Application
• Datasets are not created equal
• Results of real-world implementation
FP-Growth Algorithm
• Association Rule Mining
• Generate Frequent Itemsets
• Apriori generates candidate sets
• FP-Growth uses specialized data structures (no candidate sets)
• Find Association Rules
• Outside the scope of both FP-Growth & Apriori
• Therefore, FP-Growth is a competitor to Apriori

f 4

c 4

a 3

b 3

m 3

p 3

{}

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

FP-Growth Example

TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p}{f, c, a, m, p}

200 {a, b, c, f, l, m, o}{f, c, a, b, m}

300 {b, f, h, j, o}{f, b}

400 {b, c, k, s, p}{c, b, p}

500{a, f, c, e, l, p, m, n}{f, c, a, m, p}

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

Item

Conditional pattern-base

Conditional FP-tree

p

{(fcam:2), (cb:1)}

{(c:3)}|p

m

{(fca:2), (fcab:1)}

{(f:3, c:3, a:3)}|m

b

{(fca:1), (f:1), (c:1)}

Empty

a

{(fc:3)}

{(f:3, c:3)}|a

c

{(f:3)}

{(f:3)}|c

f

Empty

Empty

FP-Growth Example
FP-CreateTree

Input: DB, min_support

Output: FP-Tree

Scan DB & count all frequent items.

Create null root & set as current node.

For each Transaction T

Sort T’s items.

For each sorted Item I

Insert I into tree as a child of current node.

Connect new tree node to header list.

Two passes through DB

Tree creation is based on number of items in DB.

Complexity of CreateTree is O(|DB|)

FP-Growth Algorithm

f 4

c 4

a 3

b 3

m 3

p 3

{}

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

FP-Growth Example

TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p}{f, c, a, m, p}

200 {a, b, c, f, l, m, o}{f, c, a, b, m}

300 {b, f, h, j, o}{f, b}

400 {b, c, k, s, p}{c, b, p}

500{a, f, c, e, l, p, m, n}{f, c, a, m, p}

FP-Growth

Input: FP-Tree, f_is (frequent itemset)

Output: All freq_patterns

if FP-Tree contains single Path P

then for each combination of nodes in P

generate pattern f_is  combination

else for each item i in header

generate pattern f_is  i

construct pattern’s conditional cFP-Tree

if (FP-Tree  0)

then call FP-Growth (cFP-Tree, pattern)

Recursive algorithm creates FP-Tree structures and calls FP-Growth

Claims no candidate generation

FP-Growth Algorithm

A

B

C

D

Two-Part Algorithm

if FP-Tree contains single Path P

then for each combination of nodes in P

generate pattern f_is  combination

• e.g., { A B C D }
• p = |pattern| = 4
• AD, BD, CD, ABD, ACD, BCD, ABCD
• (n=1 to p-1) (p-1Cn)

A

B

C

D

C

D

E

D

E

E

Two-Part Algorithm

else for each item i in header

generate pattern f_is  i

construct pattern’s conditional cFP-Tree

if (cFP-Tree  null)

then call FP-Growth (cFP-Tree, pattern)

• e.g., { A B C D E } for f_is = D
• i = A, p_base = (ABCD), (ACD), (AD)
• i = B, p_base = (ABCD)
• i = C, p_base = (ABCD), (ACD)
• i = E, p_base = null
• Pattern bases are generated by following f_is path from header to each node in tree having it, up to tree root, for each header item.
FP-Growth Complexity
• Therefore, each path in the tree will be at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header.
• Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree)
• Creation of a new cFP-Tree occurs also.

K

K

L

L

M

M

L

L

M

M

M

M

M

M

Sample Data FP-Tree

null

A

B

C

D

C

D

E

J

K

L

M

L

M

M

M

Algorithm Results (in English)
• Candidate Generation sets exchanged for FP-Trees.
• You MUST take into account all paths that contain an item set with a test item.
• You CANNOT determine before a conditional FP-Tree is created if new frequent item sets will occur.
• Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.

Null

Transactions:

A (49)

B (44)

F (39)

G (34)

H (30)

I (20)

ABGM

BGIM

FGIM

GHIM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

FP-Growth Mining Example

A

B

F

G

H

IM

H:30

I:20

FP-Growth Mining Example

Null

i = A

pattern = { A M }

support = 1

pattern base

cFP-Tree = null

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

FP-Growth Mining Example

Null

i = B

pattern = { B M }

support = 2

pattern base

cFP-Tree …

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

G:2

B:2

M:2

FP-Growth Mining Example

G

B

M

Patterns mined:

BM

GM

BGM

freq_itemset = { B M }

All patterns: BM, GM, BGM

FP-Growth Mining Example

Null

i = F

pattern = { F M }

support = 1

pattern base

cFP-Tree = null

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

FP-Growth Mining Example

Null

i = G

pattern = { G M }

support = 4

pattern base

cFP-Tree …

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

FP-Growth Mining Example

Null

i = I

pattern = { I G M }

support = 3

pattern base

cFP-Tree …

I

B

G

M

I:3

B:1

B:1

G:2

G:1

G:1

M:2

M:1

M:1

freq_itemset = { G M }

I:3

G:3

M:3

FP-Growth Mining Example

I

G

M

Patterns mined:

IM

GM

IGM

freq_itemset = { I G M }

All patterns: BM, GM, BGM, IM, GM, IGM

FP-Growth Mining Example

Null

i = B

pattern = { B G M }

support = 2

pattern base

cFP-Tree …

I

B

G

M

I:3

B:1

B:1

G:2

G:1

G:1

M:2

M:1

M:1

freq_itemset = { G M }

B:2

G:2

M:2

FP-Growth Mining Example

B

G

M

Patterns mined:

BM

GM

BGM

freq_itemset = { B G M }

All patterns: BM, GM, BGM, IM, GM, IGM,

BM, GM, BGM

FP-Growth Mining Example

Null

i = H

pattern = { H M }

support = 1

pattern base

cFP-Tree = null

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

FP-Growth Mining Example
• Complete pass for { M }
• Move onto { I }, { H }, etc.
• Final Frequent Sets:
• L2 : { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG }
• L3 : { BGM, GIM, BGM, GIM }
• L4 : None
FP-Growth Redundancy v. Apriori Candidate Sets
• FP-Growth (support=2) generates:
• L2 : 10 sets (5 distinct)
• L3 : 4 sets (2 distinct)
• Total : 14 sets
• Apriori (support=2) generates:
• C2 : 21 sets
• C3 : 2 sets
• Total : 23 sets
• Apriori : 23 sets
• FP-Growth : 28 sets in {M} with i = A, B, F alone!
FP-Growth vs. Apriori
• Apriori visits each transaction when generating a new candidate sets; FP-Growth does not
• Can use data structures to reduce transaction list
• FP-Growth traces the set of concurrent items; Apriori generates candidate sets
• FP-Growth uses more complicated data structures & mining techniques
Algorithm Analysis Results
• FP-Growth IS NOT inherently faster than Apriori
• Intuitively, it appears to condense data
• Mining scheme requires some new work to replace candidate set generation
• Recursion obscures the additional effort
• FP-Growth may run faster than Apriori in circumstances
• No guarantee through complexity which algorithm to use for efficiency
Improvements to FP-Growth
• None currently reported
• MLFPT
• Multiple Local Frequent Pattern Tree
• New algorithm that is based on FP-Growth
• Distributes FP-Trees among processors
• No reports of complexity analysis or accuracy of FP-Growth
Real World Applications
• Zheng, Kohavi, Mason – “Real World Performance of Association Rule Algorithms”
• Collected implementations of Apriori, FP-Growth, CLOSET, CHARM, MagnumOpus
• Tested implementations against 1 artificial and 3 real data sets
• Time-based comparisons generated
Apriori & FP-Growth
• Apriori
• Implementation from creator Christian Borgelt (GNU Public License)
• C implementation
• Entire dataset loaded into memory
• FP-Growth
• Implementation from creators Han & Pei
• Version – February 5, 2001
Other Algorithms
• CHARM
• Based on concept of Closed Itemset
• e.g., { A, B, C } – ABC, AC, BC, ACB, AB, CB, etc.
• CLOSET
• Han, Pei implementation of Closed Itemset
• MagnumOpus
• Generates rules directly through search-and-prune technique
Datasets
• IBM-Artificial
• Generated at IBM Almaden (T10I4D100K)
• Often used in association rule mining studies
• BMS-POS
• Years of point-of-sale data from retailer
• BMS-WebView-1 & BMS-WebView-2
• Months of clickstream traffic from e-commerce web sites
Experimental Considerations
• Hardware Specifications
• Dual 550MHz Pentium III Xeon processors
• 1GB Memory
• Support { 1.00%, 0.80%, 0.60%, 0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% }
• Confidence = 0%
• No other applications running (second processor handles system processes)

BMS-WebView-1

BMS-WebView-2

Study Results – Real Data
• At support  0.20%, Apriori performs as fast as or better than FP-Growth
• At support < 0.20%, Apriori completes whenever FP-Growth completes
• Exception – BMS-WebView-2 @ 0.01%
• When 2 million rules are generated, Apriori finishes in 10 minutes or less
• Proposed – Bottleneck is NOT the rule algorithm, but rule analysis
Study Results – Artificial Data
• At support < 0.40%, FP-Growth performs MUCH faster than Apriori
• At support  0.40%, FP-Growth and Apriori are comparable
Real-World Study Conclusions
• FP-Growth (and other non-Apriori) perform better on artificial data
• On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets
• FP-Growth may be suitable when low support, large result count, fast generation are needed
• Future research may best be directed toward analyzing association rules
Research Conclusions
• FP-Growth does not have a better complexity than Apriori
• Common sense indicates it will run faster
• FP-Growth does not always have a better running time than Apriori
• Support, dataset appears more influential
• FP-Trees are very complex structures (Apriori is simple)
• Location of data (memory vs. disk) is non-factor in comparison of algorithms
To Use or Not To Use?
• Question: Should the FP-Growth be used in favor of Apriori?
• Difficulty to code
• High performance at extreme cases
• Personal preference
• More relevant questions
• What kind of data is it?
• What kind of results do I want?
• How will I analyze the resulting rules?
References
• Han, Jiawei, Pei, Jian, and Yin, Yiwen, “Mining Frequent Patterns without Candidate Generation”. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000.
• Orlando, Salvatore, “High Performance Mining of Short and Long Patterns”. 2001.
• Pei, Jian, Han, Jiawei, and Mao, Runying, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”. In SIGMOD Int\'l Workshop on Data Mining and Knowledge Discovery, May 2000.
• Webb, Geoffrey L., “Efficient search for association rules”. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000.
• Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, “Fast Parallel Association Rule Mining Without Candidacy Generation”. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM\'2001), San Jose, CA, USA, November 29-December 2, 2001
• Zaki, Mohammed J., “Generating Non-Redundant Association Rules”. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.
• Zheng, Zijian, Kohavi, Ron, and Mason, Llew, “Real World Performance of Association Rule Algorithms”. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.