Loading in 5 sec....

Mining Association RulesPowerPoint Presentation

Mining Association Rules

- 164 Views
- Updated On :
- Presentation posted in: General

Mining Association Rules. Charis Ermopoulos Qian Yang Yong Yang Hengzhi Zhong. Outline. Basic concepts and road map Scalable frequent pattern mining methods Association rules generation Research Problems. Frequent Pattern Analysis.

Mining Association Rules

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Mining Association Rules

Charis Ermopoulos

Qian Yang

Yong Yang

Hengzhi Zhong

- Basic concepts and road map
- Scalable frequent pattern mining methods
- Association rules generation
- Research Problems

- Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
- Frequent pattern mining
- Finding inherent regularities in data
- Foundations of many data mining tasks (association, correlation, classification etc.)
- Applications:
Basket data analysis, cross marketing, catalog design, web log analysis.

- Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)
- Find: all rules that correlate the presence of one set of items with that of another set of items
- E.g., 98% of people who purchase tires and auto accessories also get automotive services done

- Itemset X = {x1, …, xk}
- Find all the rules X Ywith minimum support and confidence

- Support = The rule X => Y has supports in the transaction set D if s% of transaction in D contain X union Y
- Confidence = The rule X => Y has confidence c if c% of transactions in D that contain X also contain Y

- Find all sets of items that meet the minimum support
- Frequent pattern mining

- Generate association rules from these sets

- 3 major approaches
- Apriori (Agrawal & Srikant@VLDB’94)
- Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
- Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)

- Initially, scan DB once to get frequent 1-itemset
- Generate length (k+1) candidate itemsets from length k frequent itemsets
- Testthe candidates against DB
- Terminate when no frequent or candidate set can be generated

Supmin = 2

L1

C1

1st scan

C2

C2

L2

2nd scan

L3

C3

3rd scan

- How to generate candidates?
- Step 1: self-joining Lk
- Step 2: pruning

- Example of Candidate-generation
- L3={abc, abd, acd, ace, bcd}
- Self-joining: L3*L3
- abcd from abc and abd
- acde from acd and ace

- Pruning:
- acde is removed because ade is not in L3

- C4={abcd}

- Multiple scans of transaction database
- Multiple database scans are costly

- Huge number of candidates
- To find frequent itemset i1i2…i100
- # of scans: 100
- # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 !

- To find frequent itemset i1i2…i100

- Uses the Apriori Pruning Principle
- Scan DB only twice!
- Once to find frequent 1-itemset (single item pattern)
- Once to construct FP-tree, the data structure of FPGrowth

TIDItems bought

100{f, a, c, d, g, i, m, p}

200{a, b, c, f, l, m, o}

300{b, f, h, j, o, w}

400{b, c, k, s, p}

500{a, f, c, e, l, p, m, n}

Header Table

Item frequency

f4

c4

a3

b3

m3

p3

TID (ordered) frequent items

100 {f, c, a, m, p}

200 {f, c, a, b, m}

300 {f, b}

400 {c, b, p}

500 {f, c, a, m, p}

{}

f:1

c:1

a:1

m:1

p:1

TIDItems bought

100{f, a, c, d, g, i, m, p}

200{a, b, c, f, l, m, o}

300{b, f, h, j, o, w}

400{b, c, k, s, p}

500{a, f, c, e, l, p, m, n}

Header Table

Item frequency

f4

c4

a3

b3

m3

p3

TID (ordered) frequent items

100 {f, c, a, m, p}

200 {f, c, a, b, m}

300 {f, b}

400 {c, b, p}

500 {f, c, a, m, p}

{}

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

{}

Header Table

Item frequency head

f4

c4

a3

b3

m3

p3

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

TID (ordered) frequent items

100 {f, c, a, m, p}

200 {f, c, a, b, m}

300 {f, b}

400 {c, b, p}

500 {f, c, a, m, p}

{}

Header Table

Item frequency head

f4

c4

a3

b3

m3

p3

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

Conditional pattern bases

Item cond. pattern base freq. itemset

pfcam:2, cb:1fp, cp, ap, mp, pfc, pfa, pfm, pfm, pca, pcm, pam, pfcam

mfca:2, fcab:1fm, cm, am, fcm, fam, cam, fcam

bfca:1, f:1, c:1…

afc:3…

cf:3…

- no candidate generation, no candidate test
- compressed database: FP-tree structure
- no repeated scan of entire database
- basic ops—counting local freq items and building sub FP-tree, no pattern search and matching

Data set T25I20D10K

- Dense dataset (http://www.cs.yorku.ca/course_archive/2005-06/F/6412/lecnotes/assorule3-2.pdf)

- DB projection
- FP-tree cannot fit in memory?

- Partition a database into a set of projected DBs
- Construct and mine FP-tree for each projected DB

Tran. DB

fcamp

fcabm

fb

cbp

fcamp

p-proj DB

fcam

cb

fcam

m-proj DB

fcab

fca

fca

b-proj DB

f

cb

…

a-proj DB

fc

…

c-proj DB

f

…

f-proj DB

…

am-proj DB

fc

fc

fc

cm-proj DB

f

f

f

…

- A frequent itemset with size s has 2s-2 frequent subsets
- S is large in many real world problems
- biosequences, census data, etc

- Only generate itemsets that cannot be subsumed others with the same support
- If A B, and sup(A) = sup (B), A will not be in the result
- But sup(A) can be inferred from B and others

- An itemset X is closed if X is frequent and X does not have a superset Y such that supt(Y) = support(X)
- Lossless compression
- Divide frequent itemsets into equivalence classes

- Each node has a Itemset and its Tidset

- Itemset Xi, Xj, and theri Tidset t(Xi), t(Xj)
- If t(Xi) = t(Xj), sup(Xi) = sup(Xj) = sup(Xi Xj)
- Xi, Xj always occurs together

- If t(Xi) t(Xj), sup(Xj) != sup(Xi) = sup(Xi Xj)
- If Xi occurs, Xj also occurs

- If t(Xj) t(Xi), sup(Xi) != sup(Xj) = sup(Xi Xj)
- If Xj occurs, Xi also occurs

- If t(Xj)!= t(Xi), sup(Xi) != sup(Xj) != sup(Xi Xj)

- If t(Xi) = t(Xj), sup(Xi) = sup(Xj) = sup(Xi Xj)

T x 1356

A x 1345

D x 2456

C x 123456

W x 12345

TA x 135

TW x 135

DT x 56

DW x 245

DA x 45

Minimum support = 3

WC x 12345

AW X 1345

AWC x 1345

TC x 1356

DC x 2456

TAC x 135

TWC x 135

DWC X 245

TAWC x 135

Tids

Itemsets

- Maintain a disk-based tidset for each item
- Vertical
- Easy to compute support: cardinality
- Intersection is expensive when tidset is large

- Diffset
- Track the differences in tids of a child node from its parent
- Save memory when tidsets are large and differences are little

X

d(XY)

Y

AB …A implies B

The easiest way to mine for association rules, is to first mine for frequent itemsets.

- Quantity Problem
- Many frequent itemsets, many rules
- Redundant rules

- Quality Problem
- Not all rules are “interesting”

- Frequent itemset : AB
- Derived rules : AB and BA
Support (AB) = P(A,B)=|AB|/N

Confidence (AB) = P(B|A)=P(B,A)/P(A)=|AB|/|A|

|AB|: count of (AB)

|A|: count of (A)

|B|: count of (B)

N: total number of records

- play basketball eat cereal [40%, 66.7%] is misleading
- The overall % of students eating cereal is 75% > 66.7%.

- play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence
- Measure of dependent/correlated events: lift

- multi-level association
Data categorized in a hierarchy

- multi-dimensional association
age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)

- quantitative association
Rules with numerical attributes

- “interesting” correlation patterns
Eg. Some items (e.g. diamonds) may occur rarely but are valuable

- Finding all the patterns in a database autonomously? — unrealistic!
- The patterns could be too many but not focused!

- Data mining should be an interactive process
- User directs what to be mined using a data mining query language (or a graphical user interface)

- Constraint-based mining
- User flexibility: provides constraints on what to be mined
- System optimization: explores such constraints for efficient mining—constraint-based mining

- Data constraint— using SQL-like queries
- find product pairs sold together in stores in Chicago in Dec.’02

- Dimension/level constraint
- in relevance to region, price, brand, customer category

- Rule (or pattern) constraint
- small sales (price < $10)

- Interestingness constraint
- strong rules: min_support 3%, min_confidence 60%

- Anti-monotonicity
- When an intemset S violates the constraint, so does any of its superset
- sum(S.Price) v is anti-monotone
- sum(S.Price) v is not anti-monotone

- Example. C: range(S.profit) 15 is anti-monotone
- Itemset ab violates C
- So does every superset of ab

- Monotonicity
- When an intemset S satisfies the constraint, so does any of its superset
- sum(S.Price) v is monotone
- min(S.Price) v is monotone

- Example. C: range(S.profit) 15
- Itemset ab satisfies C
- So does every superset of ab

Database D

L1

C1

Scan D

C2

C2

L2

Scan D

L3

C3

Scan D

Constraint:

Sum{S.price} < 5

- Convert tough constraints into anti-monotone or monotone by properly ordering items
- Examine C: avg(S.profit) 25
- Order items in value-descending order
- <a, f, g, d, b, h, c, e>

- If an itemset afb violates C
- So does afbh, afb*
- It becomes anti-monotone!

- Order items in value-descending order

- Most slides are from Jiawei Han