Association rules
Download
1 / 37

association rules - PowerPoint PPT Presentation


  • 211 Views
  • Updated On :

Association Rules. presented by Zbigniew W. Ras *,#) *) University of North Carolina – Charlotte #) ICS, Polish Academy of Sciences. M arket B asket A nalysis ( MBA ).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'association rules' - Audrey


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Association rules l.jpg

Association Rules

presented by

Zbigniew W. Ras*,#)

*) University of North Carolina – Charlotte

#) ICS, Polish Academy of Sciences


Slide2 l.jpg

Market Basket Analysis (MBA)

  • Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket”

Milk, eggs, sugar, bread

Milk, eggs, cereal, bread

Eggs, sugar

Customer1

Customer2

Customer3


Slide3 l.jpg

Market Basket Analysis

  • Given: a database of customer transactions, where each transaction is a set of items

    Find groups of items which are frequently purchased together


Slide4 l.jpg

Goal of MBA

  • Extract information on purchasing behavior

  • Actionable information: can suggest

    • new store layouts

    • new product assortments

    • which products to put on promotion

MBA applicable whenever a customer purchases multiple things in proximity


Slide5 l.jpg

Association Rules

  • Express how product/services relate to each other, and tend to group together

  • “if a customer purchases three-way calling, thenwill also purchase call-waiting”

  • Simple to understand

  • Actionable information: bundle three-way calling and call-waiting in a single package


Slide6 l.jpg

Basic Concepts

  • Transactions:

    • Relational format Compact format

    • <Tid,item><Tid,itemset>

    • <1, item1> <1, {item1,item2}>

    • <1, item2> <2, {item3}>

    • <2, item3>

  • Item: single element, Itemset: set of items

  • Supportof an itemset I [denoted by sup(I)]:card(I)

  • Threshold for minimum support: 

  • Itemset I is Frequent if: sup(I).

  • Frequent Itemset represents set of items which are

  • positively correlated

itemset


Frequent itemset s l.jpg
Frequent Itemsets

Customer 1

sup({dairy}) = 3

sup({fruit}) = 3

sup({dairy, fruit}) = 2

If= 3, then

{dairy}and{fruit}are frequent while {dairy,fruit}is not.

Customer 2


Association rules ar s c l.jpg
Association Rules: AR(s,c)

  • {A,B} - partition of a set of items

  • r = [AB]

    Supportofr: sup(r) = sup(AB)

    Confidenceofr: conf(r) = sup(AB)/sup(A)

  • Thresholds:

    • minimum support - s

    • minimum confidence –c

      r  AS(s, c), if sup(r)  sandconf(r)  c


Association rules example l.jpg
Association Rules - Example

Min. support – 2 [50%]

Min. confidence - 50%

  • For rule A  C:

    • sup(A  C) = 2

    • conf(A  C) = sup({A,C})/sup({A}) = 2/3

  • The Apriori principle:

    • Any subset of a frequent itemset must be frequent


The apriori algorithm agrawal l.jpg
The Apriori algorithm[Agrawal]

  • Fk : Set of frequent itemsets of size k

  • Ck : Set of candidate itemsets of size k

    F1:= {frequent items}; k:=1;

    while card(Fk)  1 do begin

    Ck+1 := new candidates generated from Fk;

    for each transaction t in the database do

    increment the count of all candidates in Ck+1 that

    are contained in t ;

    Fk+1 := candidates in Ck+1 with minimum support

    k:= k+1

    end

    Answer := { Fk: k  1 & card(Fk)  1}


Apriori example l.jpg

a,b,c,d

a, b, c

a, b, d

a, c, d

b, c, d

a, b

a, c

a, d

b, c

b, d

c, d

a

b

c

d

Apriori - Example

{a,d}is not frequent, so the 3-itemsets{a,b,d},{a,c,d}and the4-itemset {a,b,c,d},are not generated.


Slide12 l.jpg

Algorithm Apriori: Illustration

Large support

items

{A} 3

{B} 2

{C} 2

{A,C} 2

  • The task of mining association rules is mainly to discover strong association rules (high confidence and strong support) in large databases.

    • Mining association rules is composed of two steps:

TID Items

1000 A, B, C

2000 A, C

3000 A, D

4000 B, E, F

  • 1. discover the large items, i.e., the sets of itemsets that have

  • transaction support above a

  • predetermined minimum support s.

  • 2. Use the large itemsets to generate

  • the association rules

MinSup = 2


Algorithm apriori illustration l.jpg
Algorithm Apriori: Illustration

C1

F1

Database D

S = 2

Itemset Count

{A}

{B}

{C}

{D}

{E}

Itemset Count

{A} 2

{B} 3

{C} 3

{E} 3

TID Items

100 A, C, D

200 B, C, E

300 A, B, C, E

400 B, E

2

Scan

D

3

3

1

3

C2

C2

F2

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset

{A,B}

{A,C}

{A,E}

{B,C}

{B,E}

{C,E}

Itemset Count

{A, C} 2

{B, C} 2

{B, E} 3

{C, E} 2

Itemset Count

1

Scan

D

2

1

2

3

2

C3

C3

F3

Scan

D

{B, C, E}

Itemset

{B, C, E} 2

{B, C, E} 2

Itemset Count

Itemset Count


Representative association rules l.jpg
Representative Association Rules

  • Definition 1.

    Cover C of a rule X  Y is denoted by C(X  Y)

    and defined as follows:

    C(X  Y) = { [X  Z]  V : Z, V are disjoint subsets of Y}.

  • Definition 2.

    Set RR(s, c) of Representative Association Rules

    is defined as follows:

    RR(s, c) =

    {r  AR(s, c): ~(rl AR(s, c)) [rl r & r  C(rl)]}

    s – threshold for minimum support

    c – threshold for minimum confidence

  • Representative Rules (informal description):

    [as short as possible]  [as long as possible]


Slide15 l.jpg

Representative Association Rules

Transactions:

{A,B,C,D,E}

{A,B,C,D,E,F}

{A,B,C,D,E,H,I}

{A,B,E}

{B,C,D,E,H,I}

Find RR(2,80%)

Representative Rules

From (BCDEHI):

{H}  {B,C,D,E,I}

{I}  {B,C,D,E,H}

From (ABCDE):

{A,C}  {B,D,E}

{A,D}  {B,C,E}

Last set: (BCDEHI, 2)


Slide16 l.jpg

Frequent Pattern (FP) Growth Strategy

Minimum Support = 2

Transactions:

abcde

abc

acde

bcde

bc

bde

cde

Transactions

ordered:

cbdea

cba

cdea

cbde

cb

bde

cde

FP-tree

Frequent Items:

c – 6

b – 5

d – 5

e – 5

a – 3


Slide17 l.jpg

Frequent Pattern (FP) Growth Strategy

Mining the FP-tree for frequent itemsets:

Start from each item and construct a subdatabase of transactions (prefix paths) with that item listed at the end.

Reorder the prefix paths in support descending order. Build a conditional FP-tree.

a – 3

Prefix path:

(c b d e a, 1)

(c b a, 1)

(c d e a, 1)

Correct order:

c – 3

b – 2

d – 2

e – 2


Slide18 l.jpg

Frequent Pattern (FP) Growth Strategy

a – 3

Prefix path:

(c b d e a, 1)

(c b a, 1)

(c d e a, 1)

Frequent Itemsets:

(c a, 3)

(c b a, 2)

(c d a, 2)

(c d e a, 2)

(c e a, 2)


Slide19 l.jpg

Multidimensional AR

Associations between values of different attributes :

RULES:

[nationality = French] [income = high] [50%, 100%]

[income = high] [nationality = French] [50%, 75%]

[age= 50]  [nationality = Italian] [33%, 100%]


Single dimensional ar vs multi dimensional l.jpg
Single-dimensional AR vs Multi-dimensional

Multi-dimensionalSingle-dimensional

<1, Italian, 50, low> <1, {nat/Ita, age/50, inc/low}>

<2, French, 45, high> <2, {nat/Fre, age/45, inc/high}>

Schema: <ID, a?, b?, c?, d?>

<1, yes, yes, no, no> <1, {a, b}>

<2, yes, no, yes, no> <2, {a, c}>


Slide21 l.jpg

Quantitative Attributes

  • Quantitative attributes (e.g. age, income)

  • Categorical attributes (e.g. color of car)

Problem: too many distinct values

Solution: transform quantitative attributes into

categorical ones via discretization.


Discretization of q uantitative attributes l.jpg
Discretization of quantitative attributes

  • Quantitative attributes are statically discretized by

    using predefined concept hierarchies:

    • elementary use of background knowledge

      Loose interaction between Apriori and Discretizer

  • Quantitative attributes are dynamically discretized

    • into “bins” based on the distribution of the data.

    • considering the distance between data points.

      Tighter interaction between Apriori and Discretizer


Constraint based ar l.jpg
Constraint-based AR

  • Preprocessing: use constraints to focus on a subset of transactions

    • Example: find association rules where the prices of all items are at most 200 Euro

  • Optimizations: use constraints to optimize Apriori algorithm

    • Anti-monotonicity: when a set violates the constraint, so does any of its supersets.

    • Apriori algorithm uses this property for pruning

  • Push constraints as deep as possible inside the frequent set computation


Apriori property revisited l.jpg
Apriori property revisited

  • Anti-monotonicity: If a set S violates the constraint, any superset of S violates the constraint.

  • Examples:

    • [Price(S)  v] is anti-monotone

    • [Price(S)  v] is not anti-monotone

    • [Price(S) = v] is partly anti-monotone

  • Application:

    • Push [Price(S)  1000] deeply into iterative

      frequent set computation.


Mining association rules with constraints l.jpg
Mining Association Rules with Constraints

  • Post processing

    • A naive solution: apply Apriori for finding all frequent sets, and then to test them for constraint satisfaction one by one.

  • Optimization

    • Han’s approach: comprehensive analysis of the properties of constraints and try to push them as deeply as possible inside the frequent set computation.


Multilevel ar l.jpg
Multilevel AR

  • It is difficult to find interesting patterns at a too primitive level

    • high support = too few rules

    • low support = too many rules, most uninteresting

  • Approach: reason at suitable level of abstraction

  • A common form of background knowledge is that an attribute may be generalized or specialized according to a hierarchy of concepts

  • Dimensions and levels can be efficiently encoded in transactions

  • Multilevel Association Rules: rules which combine associations with hierarchy of concepts



Slide28 l.jpg

Multilevel AR

Fresh

[support = 20%]

Dairy

[support = 6%]

Fruit

[support = 1%]

Vegetable

[support = 7%]

  • FreshBakery [20%, 60%]

  • DairyBread [6%, 50%]

  • FruitBread [1%, 50%] is not valid


Support and confidence of multilevel association rules l.jpg
Support and Confidence of Multilevel Association Rules

  • Generalizing/specializing values of attributes affects support and confidence

  • from specialized to general: support of rules increases (new rules may become valid)

  • from general to specialized: support of rules decreases (rules may become not valid, their support falls underthe threshold)


Mining multilevel ar l.jpg

age

salary

young middle-aged old

low medium high

18 … 29 30 … 60 61 … 80

10k…40k 50k 60k 70k 80k…100k

Mining Multilevel AR

Hierarchical attributes: age, salary

Association Rule: (age, young)  (salary, 40k)

Candidate Association Rules:

(age, 18)  (salary, 40k),

(age, young)  (salary, low),

(age, 18)  (salary, low)


Mining multilevel ar31 l.jpg
Mining Multilevel AR

  • Calculate frequent itemsets at each concept level,

    until no more frequent itemsets can be found

  • For each level use Apriori

  • A top_down, progressive deepening approach:

    • First find high-level strong rules:

      fresh  bakery [20%, 60%].

    • Then find their lower-level “weaker” rules:

      fruit  bread [6%, 50%].

  • Variations at mining multiple-level association rules.

    • Level-crossed association rules:

      fruit wheat bread


Multi level association uniform support vs reduced support l.jpg
Multi-level Association: Uniform Support vs. Reduced Support

  • Uniform Support: the same minimum support for all levels

    • One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support.

    • If support threshold

      • too high  miss low level associations.

      • too low  generate too many high level associations.

  • Reduced Support: reduced minimum support at lower levels - different strategies possible.


  • Beyond support and confidence l.jpg
    Beyond Support and Confidence

    • Example 1: (Aggarwal & Yu)

    • {tea} => {coffee} has high support (20%) and confidence (80%)

    • However, a priori probability that a customer buys coffee is 90%

      • A customer who is known to buy tea is less likely to buy coffee (by 10%)

      • There is a negative correlation between buying tea and buying coffee

      • {~tea} => {coffee} has higher confidence (93%)


    Correlation and interest l.jpg
    Correlation and Interest

    • Two events are independent

      if P(A  B) = P(A)*P(B), otherwise are correlated.

    • Interest = P(A  B)/P(B)*P(A)

    • Interest expresses measure of correlation. If:

      • equal to 1 A and B are independent events

      • less than 1 A and B negatively correlated,

      • greater than 1 A and B positively correlated.

      • In our example,

        I(drink tea  drink coffee ) = 0.89 i.e. they are negatively correlated.


    Domain dependent measures l.jpg
    Domain dependent measures

    • Together with support, confidence, interest, …, use also (in post-processing) domain-dependent measures

      e.g., use rule constraints on rules

    • Example: take only rules which are significant with respect their economic value

      sup(LHS)+ sup(RHS) > 100


    A brief history of ar mining research l.jpg
    A brief history of AR mining research

    • Apriori (Agrawal et. al SIGMOD93)

    • Optimizations of Apriori

      • Fast algorithm (Agrawal et. al)

      • Representative Rules (Kryszkiewicz, Agrawal)

      • Direct Itemset Counting (Brin et. al)

  • Problem extensions

    • Generalized AR (Srikant et. al; Han et. al.)

    • Quantitative AR (Srikant et. al)

    • N-dimensional AR (Lu et. al)

    • Temporal AR (Ozden et al)

  • Parallel mining (Agrawal et. al)

  • Distributed mining (Cheung et. al)


  • Questions l.jpg
    Questions?

    Thank You


    ad