Chapter 5 Mining Association Rules with FP Tree

1 / 26

# Chapter 5 Mining Association Rules with FP Tree - PowerPoint PPT Presentation

Chapter 5 Mining Association Rules with FP Tree. Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010. Mining Frequent Itemsets without Candidate Generation.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Chapter 5 Mining Association Rules with FP Tree' - landen

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Chapter 5 Mining Association Rules with FP Tree

Dr. Bernard Chen Ph.D.

University of Central Arkansas

Fall 2010

Mining Frequent Itemsets without Candidate Generation
• In many cases, the Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain.
• However, it suffer from two nontrivial costs:
• It may generate a huge number of candidates (for example, if we have 10^4 1-itemset, it may generate more than 10^7 candidata 2-itemset)
• It may need to scan database many times
Bottleneck of Frequent-pattern Mining
• Multiple database scans are costly
• Mining long patterns needs many passes of scanning and generates lots of candidates
• To find frequent itemset i1i2…i100
• # of scans: 100
• # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 !
• Bottleneck: candidate-generation-and-test
• Can we avoid candidate generation?
Mining Frequent Patterns WithoutCandidate Generation
• Grow long patterns from short ones using local frequent items
• “abc” is a frequent pattern
• Get all transactions having “abc”: DB|abc
• “d” is a local frequent item in DB|abc  abcd is a frequent pattern
Process of FP growth
• Scan DB once, find frequent 1-itemset (single item pattern)
• Sort frequent items in frequency descending order
• Scan DB again, construct FP-tree
Association Rules
• Let’s have an example
• T100 1,2,5
• T200 2,4
• T300 2,3
• T400 1,2,4
• T500 1,3
• T600 2,3
• T700 1,3
• T800 1,2,3,5
• T900 1,2,3
Benefits of the FP-tree Structure
• Completeness
• Preserve complete information for frequent pattern mining
• Never break a long pattern of any transaction
• Compactness
• Reduce irrelevant info—infrequent items are gone
• Items in frequency descending order: the more frequently occurring, the more likely to be shared
• Never be larger than the original database (not count node-links and the count field)
• For Connect-4 DB, compression ratio could be over 100
Exercise
• A dataset has five transactions, let min-support=60% and min_confidence=80%
• Find all frequent itemsets using FP Tree
Association Rules with Apriori

K:5KE:4 KE

E:4 KM:3 KM

M:3KO:3 KO

O:3 => KY:3 => KY => KEO

Y:3 EM:2 EO

EO:3

EY:2

MO:1

MY:2

OY:2

Association Rules with FP Tree

Y: KEMO:1 KEO:1 KY:1

K:3 KY

O: KEM:1 KE:2

KE:3 KO EO KEO

M: KE:2 K:1

K:3 KM

E: K:4 KE

Data set T25I20D10K

Why Is FP-Growth the Winner?
• Divide-and-conquer:
• decompose both the mining task and DB according to the frequent patterns obtained so far
• leads to focused search of smaller databases
• Other factors
• no candidate generation, no candidate test
• compressed database: FP-tree structure
• no repeated scan of entire database
• basic ops—counting local freq items and building sub FP-tree, no pattern search and matching

### Strong Association Rules are not necessary interesting

Dr. Bernard Chen Ph.D.

University of Central Arkansas

Fall 2010

Example 5.8 Misleading “Strong” Association Rule
• Of the 10,000 transactions analyzed, the data show that
• 6,000 of the customer included computer games,
• while 7,500 include videos,
• And 4,000 included both computer games and videos
Misleading “Strong” Association Rule
• For this example:
• Support (Game & Video) =

4,000 / 10,000 =40%

• Confidence (Game => Video) =

4,000 / 6,000 = 66%

• Suppose it pass our minimum support and confidence (30% , 60%, respectively)
Misleading “Strong” Association Rule
• However, the truth is : “computer games and videos are negatively associated”
• Which means the purchase of one of these items actually decreases the likelihood of purchasing the other.
• (How to get this conclusion??)
Misleading “Strong” Association Rule
• Under the normal situation,
• 60% of customers buy the game
• 75% of customers buy the video
• Therefore, it should have 60% * 75% = 45% of people buy both
• That equals to 4,500 which is more than 4,000 (the actual value)
From Association Analysis to Correlation Analysis
• Lift is a simple correlation measure that is given as follows
• The occurrence of itemset A is independent of the occurrence of itemset B if

P(AUB) = P(A)P(B)

• Otherwise, itemset A and B are dependent and correlated as events
• Lift(A,B) = P(AUB) / P(A)P(B)
• If the value is less than 1, the occurrence of A is negatively correlated with the occurrence of B
• If the value is greater than 1, then A and B are positively correlated
Mining Multiple-Level Association Rules
• Items often form hierarchies
Mining Multiple-Level Association Rules
• Items often form hierarchies

uniform support

reduced support

Level 1

min_sup = 5%

Milk

[support = 10%]

Level 1

min_sup = 5%

Level 2

min_sup = 5%

2% Milk

[support = 6%]

Skim Milk

[support = 4%]

Level 2

min_sup = 3%

Mining Multiple-Level Association Rules
• Flexible support settings
• Items at the lower level are expected to have lower support
Multi-level Association: Redundancy Filtering
• Some rules may be redundant due to “ancestor” relationships between items.
• Example
• milk  wheat bread [support = 8%, confidence = 70%]
• 2% milk  wheat bread [support = 2%, confidence = 72%]
• We say the first rule is an ancestor of the second rule.