1 / 44

# the fp-growth - PowerPoint PPT Presentation

The FP-Growth/Apriori Debate. Jeffrey R. Ellis CSE 300 – 01 April 11, 2002. Presentation Overview. FP-Growth Algorithm Refresher & Example Motivation FP-Growth Complexity Vs. Apriori Complexity Saving calculation or hiding work? Real World Application Datasets are not created equal

Related searches for the fp-growth

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'the fp-growth' - libitha

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### The FP-Growth/Apriori Debate

Jeffrey R. Ellis

CSE 300 – 01

April 11, 2002

• FP-Growth Algorithm

• Refresher & Example

• Motivation

• FP-Growth Complexity

• Vs. Apriori Complexity

• Saving calculation or hiding work?

• Real World Application

• Datasets are not created equal

• Results of real-world implementation

• Association Rule Mining

• Generate Frequent Itemsets

• Apriori generates candidate sets

• FP-Growth uses specialized data structures (no candidate sets)

• Find Association Rules

• Outside the scope of both FP-Growth & Apriori

• Therefore, FP-Growth is a competitor to Apriori

f 4

c 4

a 3

b 3

m 3

p 3

{}

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

FP-Growth Example

TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p}{f, c, a, m, p}

200 {a, b, c, f, l, m, o}{f, c, a, b, m}

300 {b, f, h, j, o}{f, b}

400 {b, c, k, s, p}{c, b, p}

500{a, f, c, e, l, p, m, n}{f, c, a, m, p}

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

Conditional pattern-base

Conditional FP-tree

p

{(fcam:2), (cb:1)}

{(c:3)}|p

m

{(fca:2), (fcab:1)}

{(f:3, c:3, a:3)}|m

b

{(fca:1), (f:1), (c:1)}

Empty

a

{(fc:3)}

{(f:3, c:3)}|a

c

{(f:3)}

{(f:3)}|c

f

Empty

Empty

FP-Growth Example

Input: DB, min_support

Output: FP-Tree

Scan DB & count all frequent items.

Create null root & set as current node.

For each Transaction T

Sort T’s items.

For each sorted Item I

Insert I into tree as a child of current node.

Connect new tree node to header list.

Two passes through DB

Tree creation is based on number of items in DB.

Complexity of CreateTree is O(|DB|)

FP-Growth Algorithm

f 4

c 4

a 3

b 3

m 3

p 3

{}

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

FP-Growth Example

TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p}{f, c, a, m, p}

200 {a, b, c, f, l, m, o}{f, c, a, b, m}

300 {b, f, h, j, o}{f, b}

400 {b, c, k, s, p}{c, b, p}

500{a, f, c, e, l, p, m, n}{f, c, a, m, p}

Input: FP-Tree, f_is (frequent itemset)

Output: All freq_patterns

if FP-Tree contains single Path P

then for each combination of nodes in P

generate pattern f_is  combination

else for each item i in header

generate pattern f_is  i

construct pattern’s conditional cFP-Tree

if (FP-Tree  0)

then call FP-Growth (cFP-Tree, pattern)

Recursive algorithm creates FP-Tree structures and calls FP-Growth

Claims no candidate generation

FP-Growth Algorithm

B

C

D

Two-Part Algorithm

if FP-Tree contains single Path P

then for each combination of nodes in P

generate pattern f_is  combination

• e.g., { A B C D }

• p = |pattern| = 4

• AD, BD, CD, ABD, ACD, BCD, ABCD

• (n=1 to p-1) (p-1Cn)

B

C

D

C

D

E

D

E

E

Two-Part Algorithm

else for each item i in header

generate pattern f_is  i

construct pattern’s conditional cFP-Tree

if (cFP-Tree  null)

then call FP-Growth (cFP-Tree, pattern)

• e.g., { A B C D E } for f_is = D

• i = A, p_base = (ABCD), (ACD), (AD)

• i = B, p_base = (ABCD)

• i = C, p_base = (ABCD), (ACD)

• i = E, p_base = null

• Pattern bases are generated by following f_is path from header to each node in tree having it, up to tree root, for each header item.

• Therefore, each path in the tree will be at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header.

• Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree)

• Creation of a new cFP-Tree occurs also.

K

L

L

M

M

L

L

M

M

M

M

M

M

Sample Data FP-Tree

null

A

B

C

D

C

D

E

J

K

L

M

L

M

M

M

• Candidate Generation sets exchanged for FP-Trees.

• You MUST take into account all paths that contain an item set with a test item.

• You CANNOT determine before a conditional FP-Tree is created if new frequent item sets will occur.

• Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.

Transactions:

A (49)

B (44)

F (39)

G (34)

H (30)

I (20)

ABGM

BGIM

FGIM

GHIM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

FP-Growth Mining Example

A

B

F

G

H

IM

H:30

I:20

Null

i = A

pattern = { A M }

support = 1

pattern base

cFP-Tree = null

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

Null

i = B

pattern = { B M }

support = 2

pattern base

cFP-Tree …

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

B:2

M:2

FP-Growth Mining Example

G

B

M

Patterns mined:

BM

GM

BGM

freq_itemset = { B M }

All patterns: BM, GM, BGM

Null

i = F

pattern = { F M }

support = 1

pattern base

cFP-Tree = null

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

Null

i = G

pattern = { G M }

support = 4

pattern base

cFP-Tree …

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

Null

i = I

pattern = { I G M }

support = 3

pattern base

cFP-Tree …

I

B

G

M

I:3

B:1

B:1

G:2

G:1

G:1

M:2

M:1

M:1

freq_itemset = { G M }

G:3

M:3

FP-Growth Mining Example

I

G

M

Patterns mined:

IM

GM

IGM

freq_itemset = { I G M }

All patterns: BM, GM, BGM, IM, GM, IGM

Null

i = B

pattern = { B G M }

support = 2

pattern base

cFP-Tree …

I

B

G

M

I:3

B:1

B:1

G:2

G:1

G:1

M:2

M:1

M:1

freq_itemset = { G M }

G:2

M:2

FP-Growth Mining Example

B

G

M

Patterns mined:

BM

GM

BGM

freq_itemset = { B G M }

All patterns: BM, GM, BGM, IM, GM, IGM,

BM, GM, BGM

Null

i = H

pattern = { H M }

support = 1

pattern base

cFP-Tree = null

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }

• Complete pass for { M }

• Move onto { I }, { H }, etc.

• Final Frequent Sets:

• L2 : { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG }

• L3 : { BGM, GIM, BGM, GIM }

• L4 : None

• FP-Growth (support=2) generates:

• L2 : 10 sets (5 distinct)

• L3 : 4 sets (2 distinct)

• Total : 14 sets

• Apriori (support=2) generates:

• C2 : 21 sets

• C3 : 2 sets

• Total : 23 sets

• What about support=1?

• Apriori : 23 sets

• FP-Growth : 28 sets in {M} with i = A, B, F alone!

• Apriori visits each transaction when generating a new candidate sets; FP-Growth does not

• Can use data structures to reduce transaction list

• FP-Growth traces the set of concurrent items; Apriori generates candidate sets

• FP-Growth uses more complicated data structures & mining techniques

• FP-Growth IS NOT inherently faster than Apriori

• Intuitively, it appears to condense data

• Mining scheme requires some new work to replace candidate set generation

• Recursion obscures the additional effort

• FP-Growth may run faster than Apriori in circumstances

• No guarantee through complexity which algorithm to use for efficiency

• None currently reported

• MLFPT

• Multiple Local Frequent Pattern Tree

• New algorithm that is based on FP-Growth

• Distributes FP-Trees among processors

• No reports of complexity analysis or accuracy of FP-Growth

• Zheng, Kohavi, Mason – “Real World Performance of Association Rule Algorithms”

• Collected implementations of Apriori, FP-Growth, CLOSET, CHARM, MagnumOpus

• Tested implementations against 1 artificial and 3 real data sets

• Time-based comparisons generated

• Apriori

• Implementation from creator Christian Borgelt (GNU Public License)

• C implementation

• Entire dataset loaded into memory

• FP-Growth

• Implementation from creators Han & Pei

• Version – February 5, 2001

• CHARM

• Based on concept of Closed Itemset

• e.g., { A, B, C } – ABC, AC, BC, ACB, AB, CB, etc.

• CLOSET

• Han, Pei implementation of Closed Itemset

• MagnumOpus

• Generates rules directly through search-and-prune technique

• IBM-Artificial

• Generated at IBM Almaden (T10I4D100K)

• Often used in association rule mining studies

• BMS-POS

• Years of point-of-sale data from retailer

• BMS-WebView-1 & BMS-WebView-2

• Months of clickstream traffic from e-commerce web sites

• Hardware Specifications

• Dual 550MHz Pentium III Xeon processors

• 1GB Memory

• Support { 1.00%, 0.80%, 0.60%, 0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% }

• Confidence = 0%

• No other applications running (second processor handles system processes)

BMS-POS

BMS-WebView-2

• At support  0.20%, Apriori performs as fast as or better than FP-Growth

• At support < 0.20%, Apriori completes whenever FP-Growth completes

• Exception – BMS-WebView-2 @ 0.01%

• When 2 million rules are generated, Apriori finishes in 10 minutes or less

• Proposed – Bottleneck is NOT the rule algorithm, but rule analysis

• At support < 0.40%, FP-Growth performs MUCH faster than Apriori

• At support  0.40%, FP-Growth and Apriori are comparable

• FP-Growth (and other non-Apriori) perform better on artificial data

• On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets

• FP-Growth may be suitable when low support, large result count, fast generation are needed

• Future research may best be directed toward analyzing association rules

• FP-Growth does not have a better complexity than Apriori

• Common sense indicates it will run faster

• FP-Growth does not always have a better running time than Apriori

• Support, dataset appears more influential

• FP-Trees are very complex structures (Apriori is simple)

• Location of data (memory vs. disk) is non-factor in comparison of algorithms

• Question: Should the FP-Growth be used in favor of Apriori?

• Difficulty to code

• High performance at extreme cases

• Personal preference

• More relevant questions

• What kind of data is it?

• What kind of results do I want?

• How will I analyze the resulting rules?

• Han, Jiawei, Pei, Jian, and Yin, Yiwen, “Mining Frequent Patterns without Candidate Generation”. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000.

• Orlando, Salvatore, “High Performance Mining of Short and Long Patterns”. 2001.

• Pei, Jian, Han, Jiawei, and Mao, Runying, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000.

• Webb, Geoffrey L., “Efficient search for association rules”. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000.

• Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, “Fast Parallel Association Rule Mining Without Candidacy Generation”. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001

• Zaki, Mohammed J., “Generating Non-Redundant Association Rules”. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.

• Zheng, Zijian, Kohavi, Ron, and Mason, Llew, “Real World Performance of Association Rule Algorithms”. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.