Mining association rules l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 39

Mining Association Rules PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on
  • Presentation posted in: General

Mining Association Rules. Charis Ermopoulos Qian Yang Yong Yang Hengzhi Zhong. Outline. Basic concepts and road map Scalable frequent pattern mining methods Association rules generation Research Problems. Frequent Pattern Analysis.

Download Presentation

Mining Association Rules

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Mining association rules l.jpg

Mining Association Rules

Charis Ermopoulos

Qian Yang

Yong Yang

Hengzhi Zhong


Outline l.jpg

Outline

  • Basic concepts and road map

  • Scalable frequent pattern mining methods

  • Association rules generation

  • Research Problems


Frequent pattern analysis l.jpg

Frequent Pattern Analysis

  • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

  • Frequent pattern mining

    • Finding inherent regularities in data

    • Foundations of many data mining tasks (association, correlation, classification etc.)

    • Applications:

      Basket data analysis, cross marketing, catalog design, web log analysis.


What is association rule mining l.jpg

What is association rule mining ?

  • Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)

  • Find: all rules that correlate the presence of one set of items with that of another set of items

    • E.g., 98% of people who purchase tires and auto accessories also get automotive services done

  • Itemset X = {x1, …, xk}

  • Find all the rules X  Ywith minimum support and confidence


What is association rule mining5 l.jpg

What is association rule mining ?

  • Support = The rule X => Y has supports in the transaction set D if s% of transaction in D contain X union Y

  • Confidence = The rule X => Y has confidence c if c% of transactions in D that contain X also contain Y

  • Association rule mining

    • Find all sets of items that meet the minimum support

      - Frequent pattern mining

    • Generate association rules from these sets


  • Generate frequent itemsets l.jpg

    Generate Frequent Itemsets

    • 3 major approaches

      • Apriori (Agrawal & [email protected])

      • Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)

      • Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)


    Apriori l.jpg

    Apriori

    • Initially, scan DB once to get frequent 1-itemset

    • Generate length (k+1) candidate itemsets from length k frequent itemsets

    • Testthe candidates against DB

    • Terminate when no frequent or candidate set can be generated


    Slide8 l.jpg

    Supmin = 2

    L1

    C1

    1st scan

    C2

    C2

    L2

    2nd scan

    L3

    C3

    3rd scan


    Apriori candidate generation l.jpg

    Apriori candidate generation

    • How to generate candidates?

      • Step 1: self-joining Lk

      • Step 2: pruning

    • Example of Candidate-generation

      • L3={abc, abd, acd, ace, bcd}

      • Self-joining: L3*L3

        • abcd from abc and abd

        • acde from acd and ace

      • Pruning:

        • acde is removed because ade is not in L3

      • C4={abcd}


    Drawbacks l.jpg

    Drawbacks

    • Multiple scans of transaction database

      • Multiple database scans are costly

    • Huge number of candidates

      • To find frequent itemset i1i2…i100

        • # of scans: 100

        • # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 !


    Fpgrowth l.jpg

    FPGrowth

    • Uses the Apriori Pruning Principle

    • Scan DB only twice!

      • Once to find frequent 1-itemset (single item pattern)

      • Once to construct FP-tree, the data structure of FPGrowth


    Fpgrowth12 l.jpg

    FPGrowth

    TIDItems bought

    100{f, a, c, d, g, i, m, p}

    200{a, b, c, f, l, m, o}

    300{b, f, h, j, o, w}

    400{b, c, k, s, p}

    500{a, f, c, e, l, p, m, n}

    Header Table

    Item frequency

    f4

    c4

    a3

    b3

    m3

    p3

    TID (ordered) frequent items

    100 {f, c, a, m, p}

    200 {f, c, a, b, m}

    300 {f, b}

    400 {c, b, p}

    500 {f, c, a, m, p}

    {}

    f:1

    c:1

    a:1

    m:1

    p:1


    Fpgrowth13 l.jpg

    FPGrowth

    TIDItems bought

    100{f, a, c, d, g, i, m, p}

    200{a, b, c, f, l, m, o}

    300{b, f, h, j, o, w}

    400{b, c, k, s, p}

    500{a, f, c, e, l, p, m, n}

    Header Table

    Item frequency

    f4

    c4

    a3

    b3

    m3

    p3

    TID (ordered) frequent items

    100 {f, c, a, m, p}

    200 {f, c, a, b, m}

    300 {f, b}

    400 {c, b, p}

    500 {f, c, a, m, p}

    {}

    f:4

    c:1

    c:3

    b:1

    b:1

    a:3

    p:1

    m:2

    b:1

    p:2

    m:1


    Fpgrowth14 l.jpg

    {}

    Header Table

    Item frequency head

    f4

    c4

    a3

    b3

    m3

    p3

    f:4

    c:1

    c:3

    b:1

    b:1

    a:3

    p:1

    m:2

    b:1

    p:2

    m:1

    FPGrowth

    TID (ordered) frequent items

    100 {f, c, a, m, p}

    200 {f, c, a, b, m}

    300 {f, b}

    400 {c, b, p}

    500 {f, c, a, m, p}


    Fpgrowth15 l.jpg

    {}

    Header Table

    Item frequency head

    f4

    c4

    a3

    b3

    m3

    p3

    f:4

    c:1

    c:3

    b:1

    b:1

    a:3

    p:1

    m:2

    b:1

    p:2

    m:1

    FPGrowth

    Conditional pattern bases

    Item cond. pattern base freq. itemset

    pfcam:2, cb:1fp, cp, ap, mp, pfc, pfa, pfm, pfm, pca, pcm, pam, pfcam

    mfca:2, fcab:1fm, cm, am, fcm, fam, cam, fcam

    bfca:1, f:1, c:1…

    afc:3…

    cf:3…


    Fpgrowth vs apriori l.jpg

    FPGrowth vs Apriori

    • no candidate generation, no candidate test

    • compressed database: FP-tree structure

    • no repeated scan of entire database

    • basic ops—counting local freq items and building sub FP-tree, no pattern search and matching


    Fpgrowth vs apriori17 l.jpg

    FPGrowth vs Apriori

    Data set T25I20D10K


    Fpgrowth vs apriori18 l.jpg

    FPGrowth vs Apriori

    • Dense dataset (http://www.cs.yorku.ca/course_archive/2005-06/F/6412/lecnotes/assorule3-2.pdf)


    Scaling l.jpg

    Scaling

    • DB projection

      • FP-tree cannot fit in memory?

    • Partition a database into a set of projected DBs

    • Construct and mine FP-tree for each projected DB


    Db projection l.jpg

    Tran. DB

    fcamp

    fcabm

    fb

    cbp

    fcamp

    p-proj DB

    fcam

    cb

    fcam

    m-proj DB

    fcab

    fca

    fca

    b-proj DB

    f

    cb

    a-proj DB

    fc

    c-proj DB

    f

    f-proj DB

    am-proj DB

    fc

    fc

    fc

    cm-proj DB

    f

    f

    f

    DB Projection


    Charm closed itemset mining l.jpg

    Charm: Closed Itemset Mining

    • A frequent itemset with size s has 2s-2 frequent subsets

    • S is large in many real world problems

      • biosequences, census data, etc

    • Only generate itemsets that cannot be subsumed others with the same support

      • If A B, and sup(A) = sup (B), A will not be in the result

      • But sup(A) can be inferred from B and others


    Closed itemset l.jpg

    Closed Itemset

    • An itemset X is closed if X is frequent and X does not have a superset Y such that supt(Y) = support(X)

    • Lossless compression

    • Divide frequent itemsets into equivalence classes


    Charm search in it tree l.jpg

    Charm: Search in IT-Tree

    • Each node has a Itemset and its Tidset


    Charm properties to prune l.jpg

    Charm properties to prune

    • Itemset Xi, Xj, and theri Tidset t(Xi), t(Xj)

      • If t(Xi) = t(Xj), sup(Xi) = sup(Xj) = sup(Xi Xj)

        • Xi, Xj always occurs together

      • If t(Xi) t(Xj), sup(Xj) != sup(Xi) = sup(Xi Xj)

        • If Xi occurs, Xj also occurs

      • If t(Xj) t(Xi), sup(Xi) != sup(Xj) = sup(Xi Xj)

        • If Xj occurs, Xi also occurs

      • If t(Xj)!= t(Xi), sup(Xi) != sup(Xj) != sup(Xi Xj)


    Slide25 l.jpg

    T x 1356

    A x 1345

    D x 2456

    C x 123456

    W x 12345

    TA x 135

    TW x 135

    DT x 56

    DW x 245

    DA x 45

    Minimum support = 3

    WC x 12345

    AW X 1345

    AWC x 1345

    TC x 1356

    DC x 2456

    TAC x 135

    TWC x 135

    DWC X 245

    TAWC x 135


    Charm diffset for fast couting l.jpg

    Tids

    Itemsets

    Charm: diffset for fast couting

    • Maintain a disk-based tidset for each item

    • Vertical

      • Easy to compute support: cardinality

      • Intersection is expensive when tidset is large

    • Diffset

      • Track the differences in tids of a child node from its parent

      • Save memory when tidsets are large and differences are little

    X

    d(XY)

    Y


    Association rules l.jpg

    Association Rules

    AB …A implies B

    The easiest way to mine for association rules, is to first mine for frequent itemsets.


    Mining association rules28 l.jpg

    Mining Association Rules

    • Quantity Problem

      • Many frequent itemsets, many rules

      • Redundant rules

    • Quality Problem

      • Not all rules are “interesting”


    Association rules29 l.jpg

    Association Rules

    • Frequent itemset : AB

    • Derived rules : AB and BA

      Support (AB) = P(A,B)=|AB|/N

      Confidence (AB) = P(B|A)=P(B,A)/P(A)=|AB|/|A|

      |AB|: count of (AB)

      |A|: count of (A)

      |B|: count of (B)

      N: total number of records


    Other measures of interestingness l.jpg

    Other measures of Interestingness

    • play basketball eat cereal [40%, 66.7%] is misleading

      • The overall % of students eating cereal is 75% > 66.7%.

    • play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence

    • Measure of dependent/correlated events: lift


    Various kinds of association rules l.jpg

    Various Kinds of Association Rules

    • multi-level association

      Data categorized in a hierarchy

    • multi-dimensional association

      age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)

    • quantitative association

      Rules with numerical attributes

    • “interesting” correlation patterns

      Eg. Some items (e.g. diamonds) may occur rarely but are valuable


    Constraint based mining l.jpg

    Constraint-based Mining

    • Finding all the patterns in a database autonomously? — unrealistic!

      • The patterns could be too many but not focused!

    • Data mining should be an interactive process

      • User directs what to be mined using a data mining query language (or a graphical user interface)

    • Constraint-based mining

      • User flexibility: provides constraints on what to be mined

      • System optimization: explores such constraints for efficient mining—constraint-based mining


    Constraints in data mining l.jpg

    Constraints in Data Mining

    • Data constraint— using SQL-like queries

      • find product pairs sold together in stores in Chicago in Dec.’02

    • Dimension/level constraint

      • in relevance to region, price, brand, customer category

    • Rule (or pattern) constraint

      • small sales (price < $10)

    • Interestingness constraint

      • strong rules: min_support  3%, min_confidence  60%


    Anti monotonicity in constraint pushing l.jpg

    Anti-Monotonicity in Constraint Pushing

    • Anti-monotonicity

      • When an intemset S violates the constraint, so does any of its superset

      • sum(S.Price)  v is anti-monotone

      • sum(S.Price)  v is not anti-monotone

    • Example. C: range(S.profit)  15 is anti-monotone

      • Itemset ab violates C

      • So does every superset of ab


    Monotonicity for constraint pushing l.jpg

    Monotonicity for Constraint Pushing

    • Monotonicity

      • When an intemset S satisfies the constraint, so does any of its superset

      • sum(S.Price)  v is monotone

      • min(S.Price)  v is monotone

    • Example. C: range(S.profit)  15

      • Itemset ab satisfies C

      • So does every superset of ab


    The constrained apriori algorithm push an anti monotone constraint deep l.jpg

    Database D

    L1

    C1

    Scan D

    C2

    C2

    L2

    Scan D

    L3

    C3

    Scan D

    Constraint:

    Sum{S.price} < 5

    The Constrained Apriori Algorithm: Push an Anti-monotone Constraint Deep


    Converting tough constraints l.jpg

    Converting “Tough” Constraints

    • Convert tough constraints into anti-monotone or monotone by properly ordering items

    • Examine C: avg(S.profit)  25

      • Order items in value-descending order

        • <a, f, g, d, b, h, c, e>

      • If an itemset afb violates C

        • So does afbh, afb*

        • It becomes anti-monotone!


    Constraint based mining a general picture l.jpg

    Constraint-Based Mining—A General Picture


    Slide39 l.jpg

    • Most slides are from Jiawei Han


  • Login