- By
**loyal** - Follow User

- 177 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Mining Association Rules' - loyal

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

- Basic concepts and road map
- Scalable frequent pattern mining methods
- Association rules generation
- Research Problems

Frequent Pattern Analysis

- Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
- Frequent pattern mining
- Finding inherent regularities in data
- Foundations of many data mining tasks (association, correlation, classification etc.)
- Applications:

Basket data analysis, cross marketing, catalog design, web log analysis.

What is association rule mining ?

- Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)
- Find: all rules that correlate the presence of one set of items with that of another set of items
- E.g., 98% of people who purchase tires and auto accessories also get automotive services done
- Itemset X = {x1, …, xk}
- Find all the rules X Ywith minimum support and confidence

What is association rule mining ?

- Support = The rule X => Y has supports in the transaction set D if s% of transaction in D contain X union Y
- Confidence = The rule X => Y has confidence c if c% of transactions in D that contain X also contain Y
- Association rule mining
- Find all sets of items that meet the minimum support

- Frequent pattern mining

- Generate association rules from these sets

Generate Frequent Itemsets

- 3 major approaches
- Apriori (Agrawal & [email protected]’94)
- Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
- Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)

Apriori

- Initially, scan DB once to get frequent 1-itemset
- Generate length (k+1) candidate itemsets from length k frequent itemsets
- Testthe candidates against DB
- Terminate when no frequent or candidate set can be generated

Apriori candidate generation

- How to generate candidates?
- Step 1: self-joining Lk
- Step 2: pruning
- Example of Candidate-generation
- L3={abc, abd, acd, ace, bcd}
- Self-joining: L3*L3
- abcd from abc and abd
- acde from acd and ace
- Pruning:
- acde is removed because ade is not in L3
- C4={abcd}

Drawbacks

- Multiple scans of transaction database
- Multiple database scans are costly
- Huge number of candidates
- To find frequent itemset i1i2…i100
- # of scans: 100
- # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 !

FPGrowth

- Uses the Apriori Pruning Principle
- Scan DB only twice!
- Once to find frequent 1-itemset (single item pattern)
- Once to construct FP-tree, the data structure of FPGrowth

FPGrowth

TID Items bought

100 {f, a, c, d, g, i, m, p}

200 {a, b, c, f, l, m, o}

300 {b, f, h, j, o, w}

400 {b, c, k, s, p}

500{a, f, c, e, l, p, m, n}

Header Table

Item frequency

f 4

c 4

a 3

b 3

m 3

p 3

TID (ordered) frequent items

100 {f, c, a, m, p}

200 {f, c, a, b, m}

300 {f, b}

400 {c, b, p}

500 {f, c, a, m, p}

{}

f:1

c:1

a:1

m:1

p:1

FPGrowth

TID Items bought

100 {f, a, c, d, g, i, m, p}

200 {a, b, c, f, l, m, o}

300 {b, f, h, j, o, w}

400 {b, c, k, s, p}

500{a, f, c, e, l, p, m, n}

Header Table

Item frequency

f 4

c 4

a 3

b 3

m 3

p 3

TID (ordered) frequent items

100 {f, c, a, m, p}

200 {f, c, a, b, m}

300 {f, b}

400 {c, b, p}

500 {f, c, a, m, p}

{}

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

Header Table

Item frequency head

f 4

c 4

a 3

b 3

m 3

p 3

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

FPGrowthConditional pattern bases

Item cond. pattern base freq. itemset

p fcam:2, cb:1 fp, cp, ap, mp, pfc, pfa, pfm, pfm, pca, pcm, pam, pfcam

m fca:2, fcab:1 fm, cm, am, fcm, fam, cam, fcam

b fca:1, f:1, c:1 …

a fc:3 …

c f:3 …

FPGrowth vs Apriori

- no candidate generation, no candidate test
- compressed database: FP-tree structure
- no repeated scan of entire database
- basic ops—counting local freq items and building sub FP-tree, no pattern search and matching

FPGrowth vs Apriori

Data set T25I20D10K

FPGrowth vs Apriori

- Dense dataset (http://www.cs.yorku.ca/course_archive/2005-06/F/6412/lecnotes/assorule3-2.pdf)

Scaling

- DB projection
- FP-tree cannot fit in memory?
- Partition a database into a set of projected DBs
- Construct and mine FP-tree for each projected DB

fcamp

fcabm

fb

cbp

fcamp

p-proj DB

fcam

cb

fcam

m-proj DB

fcab

fca

fca

b-proj DB

f

cb

…

a-proj DB

fc

…

c-proj DB

f

…

f-proj DB

…

am-proj DB

fc

fc

fc

cm-proj DB

f

f

f

…

DB ProjectionCharm: Closed Itemset Mining

- A frequent itemset with size s has 2s-2 frequent subsets
- S is large in many real world problems
- biosequences, census data, etc
- Only generate itemsets that cannot be subsumed others with the same support
- If A B, and sup(A) = sup (B), A will not be in the result
- But sup(A) can be inferred from B and others

Closed Itemset

- An itemset X is closed if X is frequent and X does not have a superset Y such that supt(Y) = support(X)
- Lossless compression
- Divide frequent itemsets into equivalence classes

Charm: Search in IT-Tree

- Each node has a Itemset and its Tidset

Charm properties to prune

- Itemset Xi, Xj, and theri Tidset t(Xi), t(Xj)
- If t(Xi) = t(Xj), sup(Xi) = sup(Xj) = sup(Xi Xj)
- Xi, Xj always occurs together
- If t(Xi) t(Xj), sup(Xj) != sup(Xi) = sup(Xi Xj)
- If Xi occurs, Xj also occurs
- If t(Xj) t(Xi), sup(Xi) != sup(Xj) = sup(Xi Xj)
- If Xj occurs, Xi also occurs
- If t(Xj)!= t(Xi), sup(Xi) != sup(Xj) != sup(Xi Xj)

A x 1345

D x 2456

C x 123456

W x 12345

TA x 135

TW x 135

DT x 56

DW x 245

DA x 45

Minimum support = 3

WC x 12345

AW X 1345

AWC x 1345

TC x 1356

DC x 2456

TAC x 135

TWC x 135

DWC X 245

TAWC x 135

Itemsets

Charm: diffset for fast couting- Maintain a disk-based tidset for each item
- Vertical
- Easy to compute support: cardinality
- Intersection is expensive when tidset is large
- Diffset
- Track the differences in tids of a child node from its parent
- Save memory when tidsets are large and differences are little

X

d(XY)

Y

Association Rules

AB …A implies B

The easiest way to mine for association rules, is to first mine for frequent itemsets.

Mining Association Rules

- Quantity Problem
- Many frequent itemsets, many rules
- Redundant rules
- Quality Problem
- Not all rules are “interesting”

Association Rules

- Frequent itemset : AB
- Derived rules : AB and BA

Support (AB) = P(A,B)=|AB|/N

Confidence (AB) = P(B|A)=P(B,A)/P(A)=|AB|/|A|

|AB|: count of (AB)

|A|: count of (A)

|B|: count of (B)

N: total number of records

Other measures of Interestingness

- play basketball eat cereal [40%, 66.7%] is misleading
- The overall % of students eating cereal is 75% > 66.7%.
- play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence
- Measure of dependent/correlated events: lift

Various Kinds of Association Rules

- multi-level association

Data categorized in a hierarchy

- multi-dimensional association

age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)

- quantitative association

Rules with numerical attributes

- “interesting” correlation patterns

Eg. Some items (e.g. diamonds) may occur rarely but are valuable

Constraint-based Mining

- Finding all the patterns in a database autonomously? — unrealistic!
- The patterns could be too many but not focused!
- Data mining should be an interactive process
- User directs what to be mined using a data mining query language (or a graphical user interface)
- Constraint-based mining
- User flexibility: provides constraints on what to be mined
- System optimization: explores such constraints for efficient mining—constraint-based mining

Constraints in Data Mining

- Data constraint— using SQL-like queries
- find product pairs sold together in stores in Chicago in Dec.’02
- Dimension/level constraint
- in relevance to region, price, brand, customer category
- Rule (or pattern) constraint
- small sales (price < $10)
- Interestingness constraint
- strong rules: min_support 3%, min_confidence 60%

Anti-Monotonicity in Constraint Pushing

- Anti-monotonicity
- When an intemset S violates the constraint, so does any of its superset
- sum(S.Price) v is anti-monotone
- sum(S.Price) v is not anti-monotone
- Example. C: range(S.profit) 15 is anti-monotone
- Itemset ab violates C
- So does every superset of ab

Monotonicity for Constraint Pushing

- Monotonicity
- When an intemset S satisfies the constraint, so does any of its superset
- sum(S.Price) v is monotone
- min(S.Price) v is monotone
- Example. C: range(S.profit) 15
- Itemset ab satisfies C
- So does every superset of ab

L1

C1

Scan D

C2

C2

L2

Scan D

L3

C3

Scan D

Constraint:

Sum{S.price} < 5

The Constrained Apriori Algorithm: Push an Anti-monotone Constraint DeepConverting “Tough” Constraints

- Convert tough constraints into anti-monotone or monotone by properly ordering items
- Examine C: avg(S.profit) 25
- Order items in value-descending order
- <a, f, g, d, b, h, c, e>
- If an itemset afb violates C
- So does afbh, afb*
- It becomes anti-monotone!

Download Presentation

Connecting to Server..