- 130 Views
- Uploaded on
- Presentation posted in: General

Mining Association Rules

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Mining Association Rules

Charis Ermopoulos

Qian Yang

Yong Yang

Hengzhi Zhong

- Basic concepts and road map
- Scalable frequent pattern mining methods
- Association rules generation
- Research Problems

- Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
- Frequent pattern mining
- Finding inherent regularities in data
- Foundations of many data mining tasks (association, correlation, classification etc.)
- Applications:
Basket data analysis, cross marketing, catalog design, web log analysis.

- Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)
- Find: all rules that correlate the presence of one set of items with that of another set of items
- E.g., 98% of people who purchase tires and auto accessories also get automotive services done

- Itemset X = {x1, …, xk}
- Find all the rules X Ywith minimum support and confidence

- Support = The rule X => Y has supports in the transaction set D if s% of transaction in D contain X union Y
- Confidence = The rule X => Y has confidence c if c% of transactions in D that contain X also contain Y

- Find all sets of items that meet the minimum support
- Frequent pattern mining

- Generate association rules from these sets

- 3 major approaches
- Apriori (Agrawal & [email protected])
- Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
- Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)

- Initially, scan DB once to get frequent 1-itemset
- Generate length (k+1) candidate itemsets from length k frequent itemsets
- Testthe candidates against DB
- Terminate when no frequent or candidate set can be generated

Supmin = 2

L1

C1

1st scan

C2

C2

L2

2nd scan

L3

C3

3rd scan

- How to generate candidates?
- Step 1: self-joining Lk
- Step 2: pruning

- Example of Candidate-generation
- L3={abc, abd, acd, ace, bcd}
- Self-joining: L3*L3
- abcd from abc and abd
- acde from acd and ace

- Pruning:
- acde is removed because ade is not in L3

- C4={abcd}

- Multiple scans of transaction database
- Multiple database scans are costly

- Huge number of candidates
- To find frequent itemset i1i2…i100
- # of scans: 100
- # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 !

- To find frequent itemset i1i2…i100

- Uses the Apriori Pruning Principle
- Scan DB only twice!
- Once to find frequent 1-itemset (single item pattern)
- Once to construct FP-tree, the data structure of FPGrowth

TIDItems bought

100{f, a, c, d, g, i, m, p}

200{a, b, c, f, l, m, o}

300{b, f, h, j, o, w}

400{b, c, k, s, p}

500{a, f, c, e, l, p, m, n}

Header Table

Item frequency

f4

c4

a3

b3

m3

p3

TID (ordered) frequent items

100 {f, c, a, m, p}

200 {f, c, a, b, m}

300 {f, b}

400 {c, b, p}

500 {f, c, a, m, p}

{}

f:1

c:1

a:1

m:1

p:1

TIDItems bought

100{f, a, c, d, g, i, m, p}

200{a, b, c, f, l, m, o}

300{b, f, h, j, o, w}

400{b, c, k, s, p}

500{a, f, c, e, l, p, m, n}

Header Table

Item frequency

f4

c4

a3

b3

m3

p3

TID (ordered) frequent items

100 {f, c, a, m, p}

200 {f, c, a, b, m}

300 {f, b}

400 {c, b, p}

500 {f, c, a, m, p}

{}

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

{}

Header Table

Item frequency head

f4

c4

a3

b3

m3

p3

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

TID (ordered) frequent items

100 {f, c, a, m, p}

200 {f, c, a, b, m}

300 {f, b}

400 {c, b, p}

500 {f, c, a, m, p}

{}

Header Table

Item frequency head

f4

c4

a3

b3

m3

p3

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

Conditional pattern bases

Item cond. pattern base freq. itemset

pfcam:2, cb:1fp, cp, ap, mp, pfc, pfa, pfm, pfm, pca, pcm, pam, pfcam

mfca:2, fcab:1fm, cm, am, fcm, fam, cam, fcam

bfca:1, f:1, c:1…

afc:3…

cf:3…

- no candidate generation, no candidate test
- compressed database: FP-tree structure
- no repeated scan of entire database
- basic ops—counting local freq items and building sub FP-tree, no pattern search and matching

Data set T25I20D10K

- Dense dataset (http://www.cs.yorku.ca/course_archive/2005-06/F/6412/lecnotes/assorule3-2.pdf)

- DB projection
- FP-tree cannot fit in memory?

- Partition a database into a set of projected DBs
- Construct and mine FP-tree for each projected DB

Tran. DB

fcamp

fcabm

fb

cbp

fcamp

p-proj DB

fcam

cb

fcam

m-proj DB

fcab

fca

fca

b-proj DB

f

cb

…

a-proj DB

fc

…

c-proj DB

f

…

f-proj DB

…

am-proj DB

fc

fc

fc

cm-proj DB

f

f

f

…

- A frequent itemset with size s has 2s-2 frequent subsets
- S is large in many real world problems
- biosequences, census data, etc

- Only generate itemsets that cannot be subsumed others with the same support
- If A B, and sup(A) = sup (B), A will not be in the result
- But sup(A) can be inferred from B and others

- An itemset X is closed if X is frequent and X does not have a superset Y such that supt(Y) = support(X)
- Lossless compression
- Divide frequent itemsets into equivalence classes

- Each node has a Itemset and its Tidset

- Itemset Xi, Xj, and theri Tidset t(Xi), t(Xj)
- If t(Xi) = t(Xj), sup(Xi) = sup(Xj) = sup(Xi Xj)
- Xi, Xj always occurs together

- If t(Xi) t(Xj), sup(Xj) != sup(Xi) = sup(Xi Xj)
- If Xi occurs, Xj also occurs

- If t(Xj) t(Xi), sup(Xi) != sup(Xj) = sup(Xi Xj)
- If Xj occurs, Xi also occurs

- If t(Xj)!= t(Xi), sup(Xi) != sup(Xj) != sup(Xi Xj)

- If t(Xi) = t(Xj), sup(Xi) = sup(Xj) = sup(Xi Xj)

T x 1356

A x 1345

D x 2456

C x 123456

W x 12345

TA x 135

TW x 135

DT x 56

DW x 245

DA x 45

Minimum support = 3

WC x 12345

AW X 1345

AWC x 1345

TC x 1356

DC x 2456

TAC x 135

TWC x 135

DWC X 245

TAWC x 135

Tids

Itemsets

- Maintain a disk-based tidset for each item
- Vertical
- Easy to compute support: cardinality
- Intersection is expensive when tidset is large

- Diffset
- Track the differences in tids of a child node from its parent
- Save memory when tidsets are large and differences are little

X

d(XY)

Y

AB …A implies B

The easiest way to mine for association rules, is to first mine for frequent itemsets.

- Quantity Problem
- Many frequent itemsets, many rules
- Redundant rules

- Quality Problem
- Not all rules are “interesting”

- Frequent itemset : AB
- Derived rules : AB and BA
Support (AB) = P(A,B)=|AB|/N

Confidence (AB) = P(B|A)=P(B,A)/P(A)=|AB|/|A|

|AB|: count of (AB)

|A|: count of (A)

|B|: count of (B)

N: total number of records

- play basketball eat cereal [40%, 66.7%] is misleading
- The overall % of students eating cereal is 75% > 66.7%.

- play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence
- Measure of dependent/correlated events: lift

- multi-level association
Data categorized in a hierarchy

- multi-dimensional association
age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)

- quantitative association
Rules with numerical attributes

- “interesting” correlation patterns
Eg. Some items (e.g. diamonds) may occur rarely but are valuable

- Finding all the patterns in a database autonomously? — unrealistic!
- The patterns could be too many but not focused!

- Data mining should be an interactive process
- User directs what to be mined using a data mining query language (or a graphical user interface)

- Constraint-based mining
- User flexibility: provides constraints on what to be mined
- System optimization: explores such constraints for efficient mining—constraint-based mining

- Data constraint— using SQL-like queries
- find product pairs sold together in stores in Chicago in Dec.’02

- Dimension/level constraint
- in relevance to region, price, brand, customer category

- Rule (or pattern) constraint
- small sales (price < $10)

- Interestingness constraint
- strong rules: min_support 3%, min_confidence 60%

- Anti-monotonicity
- When an intemset S violates the constraint, so does any of its superset
- sum(S.Price) v is anti-monotone
- sum(S.Price) v is not anti-monotone

- Example. C: range(S.profit) 15 is anti-monotone
- Itemset ab violates C
- So does every superset of ab

- Monotonicity
- When an intemset S satisfies the constraint, so does any of its superset
- sum(S.Price) v is monotone
- min(S.Price) v is monotone

- Example. C: range(S.profit) 15
- Itemset ab satisfies C
- So does every superset of ab

Database D

L1

C1

Scan D

C2

C2

L2

Scan D

L3

C3

Scan D

Constraint:

Sum{S.price} < 5

- Convert tough constraints into anti-monotone or monotone by properly ordering items
- Examine C: avg(S.profit) 25
- Order items in value-descending order
- <a, f, g, d, b, h, c, e>

- If an itemset afb violates C
- So does afbh, afb*
- It becomes anti-monotone!

- Order items in value-descending order

- Most slides are from Jiawei Han