The fp growth apriori debate
Download
1 / 44

The FP-Growth/Apriori Debate - PowerPoint PPT Presentation


  • 350 Views
  • Updated On :

The FP-Growth/Apriori Debate. Jeffrey R. Ellis CSE 300 – 01 April 11, 2002. Presentation Overview. FP-Growth Algorithm Refresher & Example Motivation FP-Growth Complexity Vs. Apriori Complexity Saving calculation or hiding work? Real World Application Datasets are not created equal

Related searches for The FP-Growth/Apriori Debate

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The FP-Growth/Apriori Debate' - libitha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The fp growth apriori debate l.jpg

The FP-Growth/Apriori Debate

Jeffrey R. Ellis

CSE 300 – 01

April 11, 2002


Presentation overview l.jpg
Presentation Overview

  • FP-Growth Algorithm

    • Refresher & Example

    • Motivation

  • FP-Growth Complexity

    • Vs. Apriori Complexity

    • Saving calculation or hiding work?

  • Real World Application

    • Datasets are not created equal

    • Results of real-world implementation


Fp growth algorithm l.jpg
FP-Growth Algorithm

  • Association Rule Mining

    • Generate Frequent Itemsets

      • Apriori generates candidate sets

      • FP-Growth uses specialized data structures (no candidate sets)

    • Find Association Rules

      • Outside the scope of both FP-Growth & Apriori

  • Therefore, FP-Growth is a competitor to Apriori


Fp growth example l.jpg

Header Table

Item frequency head

f 4

c 4

a 3

b 3

m 3

p 3

{}

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

FP-Growth Example

TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p}{f, c, a, m, p}

200 {a, b, c, f, l, m, o}{f, c, a, b, m}

300 {b, f, h, j, o}{f, b}

400 {b, c, k, s, p}{c, b, p}

500{a, f, c, e, l, p, m, n}{f, c, a, m, p}

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1


Fp growth example5 l.jpg

Item

Conditional pattern-base

Conditional FP-tree

p

{(fcam:2), (cb:1)}

{(c:3)}|p

m

{(fca:2), (fcab:1)}

{(f:3, c:3, a:3)}|m

b

{(fca:1), (f:1), (c:1)}

Empty

a

{(fc:3)}

{(f:3, c:3)}|a

c

{(f:3)}

{(f:3)}|c

f

Empty

Empty

FP-Growth Example


Fp growth algorithm6 l.jpg

FP-CreateTree

Input: DB, min_support

Output: FP-Tree

Scan DB & count all frequent items.

Create null root & set as current node.

For each Transaction T

Sort T’s items.

For each sorted Item I

Insert I into tree as a child of current node.

Connect new tree node to header list.

Two passes through DB

Tree creation is based on number of items in DB.

Complexity of CreateTree is O(|DB|)

FP-Growth Algorithm


Fp growth example7 l.jpg

Header Table

Item frequency head

f 4

c 4

a 3

b 3

m 3

p 3

{}

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

FP-Growth Example

TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p}{f, c, a, m, p}

200 {a, b, c, f, l, m, o}{f, c, a, b, m}

300 {b, f, h, j, o}{f, b}

400 {b, c, k, s, p}{c, b, p}

500{a, f, c, e, l, p, m, n}{f, c, a, m, p}


Fp growth algorithm8 l.jpg

FP-Growth

Input: FP-Tree, f_is (frequent itemset)

Output: All freq_patterns

if FP-Tree contains single Path P

then for each combination of nodes in P

generate pattern f_is  combination

else for each item i in header

generate pattern f_is  i

construct pattern’s conditional cFP-Tree

if (FP-Tree  0)

then call FP-Growth (cFP-Tree, pattern)

Recursive algorithm creates FP-Tree structures and calls FP-Growth

Claims no candidate generation

FP-Growth Algorithm


Two part algorithm l.jpg

A

B

C

D

Two-Part Algorithm

if FP-Tree contains single Path P

then for each combination of nodes in P

generate pattern f_is  combination

  • e.g., { A B C D }

    • p = |pattern| = 4

    • AD, BD, CD, ABD, ACD, BCD, ABCD

  • (n=1 to p-1) (p-1Cn)


Two part algorithm10 l.jpg

A

B

C

D

C

D

E

D

E

E

Two-Part Algorithm

else for each item i in header

generate pattern f_is  i

construct pattern’s conditional cFP-Tree

if (cFP-Tree  null)

then call FP-Growth (cFP-Tree, pattern)

  • e.g., { A B C D E } for f_is = D

    • i = A, p_base = (ABCD), (ACD), (AD)

    • i = B, p_base = (ABCD)

    • i = C, p_base = (ABCD), (ACD)

    • i = E, p_base = null

  • Pattern bases are generated by following f_is path from header to each node in tree having it, up to tree root, for each header item.


Fp growth complexity l.jpg
FP-Growth Complexity

  • Therefore, each path in the tree will be at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header.

  • Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree)

  • Creation of a new cFP-Tree occurs also.


Sample data fp tree l.jpg

K

K

L

L

M

M

L

L

M

M

M

M

M

M

Sample Data FP-Tree

null

A

B

C

D

C

D

E

J

K

L

M

L

M

M

M


Algorithm results in english l.jpg
Algorithm Results (in English)

  • Candidate Generation sets exchanged for FP-Trees.

    • You MUST take into account all paths that contain an item set with a test item.

    • You CANNOT determine before a conditional FP-Tree is created if new frequent item sets will occur.

    • Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.


Fp growth mining example l.jpg

Null

Transactions:

A (49)

B (44)

F (39)

G (34)

H (30)

I (20)

ABGM

BGIM

FGIM

GHIM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

FP-Growth Mining Example

Header:

A

B

F

G

H

IM

H:30

I:20


Fp growth mining example15 l.jpg
FP-Growth Mining Example

Null

i = A

pattern = { A M }

support = 1

pattern base

cFP-Tree = null

Header:

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }


Fp growth mining example16 l.jpg
FP-Growth Mining Example

Null

i = B

pattern = { B M }

support = 2

pattern base

cFP-Tree …

Header:

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }


Fp growth mining example17 l.jpg

G:2

B:2

M:2

FP-Growth Mining Example

Header:

G

B

M

Patterns mined:

BM

GM

BGM

freq_itemset = { B M }

All patterns: BM, GM, BGM


Fp growth mining example18 l.jpg
FP-Growth Mining Example

Null

i = F

pattern = { F M }

support = 1

pattern base

cFP-Tree = null

Header:

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }


Fp growth mining example19 l.jpg
FP-Growth Mining Example

Null

i = G

pattern = { G M }

support = 4

pattern base

cFP-Tree …

Header:

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }


Fp growth mining example20 l.jpg
FP-Growth Mining Example

Null

i = I

pattern = { I G M }

support = 3

pattern base

cFP-Tree …

Header:

I

B

G

M

I:3

B:1

B:1

G:2

G:1

G:1

M:2

M:1

M:1

freq_itemset = { G M }


Fp growth mining example21 l.jpg

I:3

G:3

M:3

FP-Growth Mining Example

Header:

I

G

M

Patterns mined:

IM

GM

IGM

freq_itemset = { I G M }

All patterns: BM, GM, BGM, IM, GM, IGM


Fp growth mining example22 l.jpg
FP-Growth Mining Example

Null

i = B

pattern = { B G M }

support = 2

pattern base

cFP-Tree …

Header:

I

B

G

M

I:3

B:1

B:1

G:2

G:1

G:1

M:2

M:1

M:1

freq_itemset = { G M }


Fp growth mining example23 l.jpg

B:2

G:2

M:2

FP-Growth Mining Example

Header:

B

G

M

Patterns mined:

BM

GM

BGM

freq_itemset = { B G M }

All patterns: BM, GM, BGM, IM, GM, IGM,

BM, GM, BGM


Fp growth mining example24 l.jpg
FP-Growth Mining Example

Null

i = H

pattern = { H M }

support = 1

pattern base

cFP-Tree = null

Header:

A

B

F

G

H

IM

A:50

B:45

F:40

G:35

B:1

G:1

G:1

H:1

G:1

I:1

I:1

I:1

M:1

M:1

M:1

M:1

freq_itemset = { M }


Fp growth mining example25 l.jpg
FP-Growth Mining Example

  • Complete pass for { M }

  • Move onto { I }, { H }, etc.

  • Final Frequent Sets:

    • L2 : { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG }

    • L3 : { BGM, GIM, BGM, GIM }

    • L4 : None


Fp growth redundancy v apriori candidate sets l.jpg
FP-Growth Redundancy v. Apriori Candidate Sets

  • FP-Growth (support=2) generates:

    • L2 : 10 sets (5 distinct)

    • L3 : 4 sets (2 distinct)

    • Total : 14 sets

  • Apriori (support=2) generates:

    • C2 : 21 sets

    • C3 : 2 sets

    • Total : 23 sets

  • What about support=1?

    • Apriori : 23 sets

    • FP-Growth : 28 sets in {M} with i = A, B, F alone!


Fp growth vs apriori l.jpg
FP-Growth vs. Apriori

  • Apriori visits each transaction when generating a new candidate sets; FP-Growth does not

    • Can use data structures to reduce transaction list

  • FP-Growth traces the set of concurrent items; Apriori generates candidate sets

  • FP-Growth uses more complicated data structures & mining techniques


Algorithm analysis results l.jpg
Algorithm Analysis Results

  • FP-Growth IS NOT inherently faster than Apriori

    • Intuitively, it appears to condense data

    • Mining scheme requires some new work to replace candidate set generation

    • Recursion obscures the additional effort

  • FP-Growth may run faster than Apriori in circumstances

  • No guarantee through complexity which algorithm to use for efficiency


Improvements to fp growth l.jpg
Improvements to FP-Growth

  • None currently reported

  • MLFPT

    • Multiple Local Frequent Pattern Tree

    • New algorithm that is based on FP-Growth

    • Distributes FP-Trees among processors

  • No reports of complexity analysis or accuracy of FP-Growth


Real world applications l.jpg
Real World Applications

  • Zheng, Kohavi, Mason – “Real World Performance of Association Rule Algorithms”

    • Collected implementations of Apriori, FP-Growth, CLOSET, CHARM, MagnumOpus

    • Tested implementations against 1 artificial and 3 real data sets

    • Time-based comparisons generated


Apriori fp growth l.jpg
Apriori & FP-Growth

  • Apriori

    • Implementation from creator Christian Borgelt (GNU Public License)

    • C implementation

    • Entire dataset loaded into memory

  • FP-Growth

    • Implementation from creators Han & Pei

    • Version – February 5, 2001


Other algorithms l.jpg
Other Algorithms

  • CHARM

    • Based on concept of Closed Itemset

    • e.g., { A, B, C } – ABC, AC, BC, ACB, AB, CB, etc.

  • CLOSET

    • Han, Pei implementation of Closed Itemset

  • MagnumOpus

    • Generates rules directly through search-and-prune technique


Datasets l.jpg
Datasets

  • IBM-Artificial

    • Generated at IBM Almaden (T10I4D100K)

    • Often used in association rule mining studies

  • BMS-POS

    • Years of point-of-sale data from retailer

  • BMS-WebView-1 & BMS-WebView-2

    • Months of clickstream traffic from e-commerce web sites



Experimental considerations l.jpg
Experimental Considerations

  • Hardware Specifications

    • Dual 550MHz Pentium III Xeon processors

    • 1GB Memory

  • Support { 1.00%, 0.80%, 0.60%, 0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% }

  • Confidence = 0%

  • No other applications running (second processor handles system processes)



Slide37 l.jpg

BMS-WebView-1

BMS-WebView-2


Study results real data l.jpg
Study Results – Real Data

  • At support  0.20%, Apriori performs as fast as or better than FP-Growth

  • At support < 0.20%, Apriori completes whenever FP-Growth completes

    • Exception – BMS-WebView-2 @ 0.01%

  • When 2 million rules are generated, Apriori finishes in 10 minutes or less

    • Proposed – Bottleneck is NOT the rule algorithm, but rule analysis



Study results artificial data l.jpg
Study Results – Artificial Data

  • At support < 0.40%, FP-Growth performs MUCH faster than Apriori

  • At support  0.40%, FP-Growth and Apriori are comparable


Real world study conclusions l.jpg
Real-World Study Conclusions

  • FP-Growth (and other non-Apriori) perform better on artificial data

  • On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets

  • FP-Growth may be suitable when low support, large result count, fast generation are needed

  • Future research may best be directed toward analyzing association rules


Research conclusions l.jpg
Research Conclusions

  • FP-Growth does not have a better complexity than Apriori

    • Common sense indicates it will run faster

  • FP-Growth does not always have a better running time than Apriori

    • Support, dataset appears more influential

  • FP-Trees are very complex structures (Apriori is simple)

  • Location of data (memory vs. disk) is non-factor in comparison of algorithms


To use or not to use l.jpg
To Use or Not To Use?

  • Question: Should the FP-Growth be used in favor of Apriori?

    • Difficulty to code

    • High performance at extreme cases

    • Personal preference

  • More relevant questions

    • What kind of data is it?

    • What kind of results do I want?

    • How will I analyze the resulting rules?


References l.jpg
References

  • Han, Jiawei, Pei, Jian, and Yin, Yiwen, “Mining Frequent Patterns without Candidate Generation”. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000.

  • Orlando, Salvatore, “High Performance Mining of Short and Long Patterns”. 2001.

  • Pei, Jian, Han, Jiawei, and Mao, Runying, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000.

  • Webb, Geoffrey L., “Efficient search for association rules”. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000.

  • Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, “Fast Parallel Association Rule Mining Without Candidacy Generation”. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001

  • Zaki, Mohammed J., “Generating Non-Redundant Association Rules”. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.

  • Zheng, Zijian, Kohavi, Ron, and Mason, Llew, “Real World Performance of Association Rule Algorithms”. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.