fast algorithms for mining association rules n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Fast Algorithms for Mining Association Rules PowerPoint Presentation
Download Presentation
Fast Algorithms for Mining Association Rules

Loading in 2 Seconds...

play fullscreen
1 / 57

Fast Algorithms for Mining Association Rules - PowerPoint PPT Presentation


  • 202 Views
  • Uploaded on

Fast Algorithms for Mining Association Rules. Rakesh Agrawal Ramakrishnan Srikant. Outline. Introduction Formal statement Apriori Algorithm AprioriTid Algorithm Comparison AprioriHybrid Algorithm Conclusions. Introduction. Bar-Code technology

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Fast Algorithms for Mining Association Rules' - locke


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
fast algorithms for mining association rules

Fast Algorithms for Mining Association Rules

Rakesh Agrawal

Ramakrishnan Srikant

outline
Data Mining Seminar 2003Outline
  • Introduction
  • Formal statement
  • Apriori Algorithm
  • AprioriTid Algorithm
  • Comparison
  • AprioriHybrid Algorithm
  • Conclusions
introduction
Data Mining Seminar 2003Introduction
  • Bar-Code technology
  • Mining Association Rules over basket data (93)
  • Tires ^ accessories  automotive service
  • Cross market, Attached mail.
  • Very large databases.
notation
Data Mining Seminar 2003Notation
  • Items – I = {i1,i2,…,im}
  • Transaction – set of items
    • Items are sorted lexicographically
  • TID – unique identifier for each transaction
notation1
Data Mining Seminar 2003Notation
  • Association Rule – X  Y
confidence and support
Data Mining Seminar 2003Confidence and Support
  • Association rule XY has confidence c,

c% of transactions in D that contain X also contain Y.

  • Association rule XY has support s,

s% of transactions in D contain X and Y.

notice
Data Mining Seminar 2003Notice
  • X  A doesn’t mean X+YA
    • May not have minimum support
  • X  A and A  Z

doesn’t mean X  Z

    • May not have minimum confidence
define the problem
Data Mining Seminar 2003Define the Problem

Given a set of transactions D, generate all association rules that have support and confidence greater than the user-specified minimum support and minimum confidence.

previous algorithms
Data Mining Seminar 2003Previous Algorithms
  • AIS
  • SETM
  • Knowledge Discovery
  • Induction of Classification Rules
  • Discovery of causal rules
  • Fitting of function to data
  • KID3 – machine learning
discovering all association rules
Data Mining Seminar 2003Discovering all Association Rules
  • Find all Large itemsets
    • itemsets with support above minimum support.
  • Use Large itemsets to generate the rules.
general idea
Data Mining Seminar 2003General idea
  • Say ABCD and AB are large itemsets
  • Compute

conf = support(ABCD) / support(AB)

  • If conf >= minconf

AB  CD holds.

discovering large itemsets
Data Mining Seminar 2003Discovering Large Itemsets
  • Multiple passes over the data
  • First pass– count the support of individual items.
  • Subsequent pass
    • Generate Candidates using previous pass’s large itemset.
    • Go over the data and check the actual support of the candidates.
  • Stop when no new large itemsets are found.
the trick
Data Mining Seminar 2003The Trick

Anysubset of large itemset is large.

Therefore

To find large k-itemset

  • Create candidatesby combining large k-1 itemsets.
  • Delete those that contain any subset that is not large.
algorithm apriori
Data Mining Seminar 2003Algorithm Apriori

Count item occurrences

Generate new k-itemsets candidates

Find the support of all the candidates

Take only those with support over minsup

candidate generation
Data Mining Seminar 2003Candidate generation
  • Join step
  • Prune step

P and q are 2 k-1 large itemsets identical in all k-2 first items.

Join by adding the last item of q to p

Check all the subsets, remove a candidate with “small” subset

example
Data Mining Seminar 2003Example

L3 = { {1 2 3}, {1 24}, {1 3 4}, {1 3 5}, {2 3 4} }

After joining

{ {1 2 3 4}, {1 3 4 5} }

After pruning

{1 2 3 4}

{1 4 5} and {3 4 5}

Are not in L3

correctness
Data Mining Seminar 2003Correctness

Show that

Any subset of large itemset must also be large

Join is equivalent to extending Lk-1 with all items and removing those whose (k-1) subsets are not in Lk-1

Preventsduplications

subset function
Data Mining Seminar 2003Subset Function
  • Candidate itemsets - Ck are stored in a hash-tree
  • Finds in O(k) time whether a candidate itemset of size k is contained in transaction t.
  • Total time O(max(k,size(t))
problem
Data Mining Seminar 2003Problem?
  • Every pass goes over the whole data.
algorithm aprioritid
Data Mining Seminar 2003Algorithm AprioriTid
  • Uses the database only once.
  • Builds a storage set C^k
    • Members has the form < TID, {Xk} >
      • Xk are potentially large k-items in transaction TID.
      • For k=1, C^1 is the database.
  • Uses C^k in pass k+1.

Each item is replaced by an itemset of size 1

advantage
Data Mining Seminar 2003Advantage
  • C^k could be smaller than the database.
    • If a transaction does not contain k-itemset candidates, than it will be excluded from C^k .
  • For large k, each entry may be smaller than the transaction
    • The transaction might contain only few candidates.
disadvantage
Data Mining Seminar 2003Disadvantage
  • For small k, each entry may be larger than the corresponding transaction.
    • An entry includes all k-itemsets contained in the transaction.
algorithm aprioritid1
Data Mining Seminar 2003Algorithm AprioriTid

Count item occurrences

The storage set is initialized with the database

Generate new k-itemsets candidates

Build a new storage set

Determine candidate itemsets which are containted in transaction TID

Find the support of all the candidates

Remove empty entries

Take only those with support over minsup

slide24
Data Mining Seminar 2003

C^1

L1

Database

C2

C^2

L2

C^3

L3

C3

correctness1
Data Mining Seminar 2003Correctness
  • Show that Ct generated in the kth pass is the same as set of candidate k-itemsets in Ck contained in transaction with t.TID
correctness2
Data Mining Seminar 2003Correctness

t of C^k t.set-of-itemsets doesn’t include any k-itemsets not contained in transaction with t.TID

t of C^k t.set-of-itemsets includes all large k-itemsets contained in transaction with t.TID

Lemma 1

k >1, ifC^k-1 is correct and complete, and Lk-1 is correct,

Then the set Ct generated at the kth pass is the same as the set of candidate k-itemsets in Ck contained in transaction with t.TID

Same as the set of all large k-itemsets

proof
Data Mining Seminar 2003Proof

Suppose a candidate itemset

c = c[1]c[2]…c[k] is intransaction t.TID

c1 = (c-c[k]) and c2=(c-c[k-1]) were in transaction t.TID 

c1 and c2 must be large 

c1 and c2 were members of t.set-of-items 

c will be a member of Ct

Ck was built using apriori-gen(Lk-1) all subsets of c of Ck must be large

C^k-1 is complete

proof1
Data Mining Seminar 2003Proof

Suppose c1 (c2) is not in transaction t.TID 

c1 (c2) is not in t.set-of-itemsets 

c of Ck is not in transaction t.TID 

c will not be a member of Ct

C^k-1 is correct

correctness3
Data Mining Seminar 2003Correctness

Lemma 2

k >1, if Lk-1 is correct and the set Ct generated in the kth step is the same as the set of candidate k-itemsets in Ck in transaction t.TID, then the set C^k is correct and complete.

proof2
Data Mining Seminar 2003Proof

Apriori-gen guarantees

 Ct includes all large k-itemsets in t.TID, which are added to C^k

C^k is complete.

Ct includes only itemsets in t.TID, only items in Ct are added to C^k 

C^k is correct.

correctness4
Data Mining Seminar 2003Correctness

Theorem 1

k >1, the set Ct generated in the kth pass is the same as the set of candidate k-itemsets in Ck contained in transaction t.TID

Show:

C^k is correct and complete and Lk is correct for all k>=1.

proof by induction on k
Data Mining Seminar 2003Proof (by induction on k)
  • K=1 – C^1 is the database.
  • Assume it holds for k=n.
    • Ct generated in pass n+1 consists of exactly those itemsets in Cn+1 contained in transaction t.TID.
    • Apriori-gen guarantees and Ct is correct  Ln+1 is correctC^n+1 will be correct and complete C^k is correct and complete for all k>=1 The theorem holds

Lemma 2

Lemma 1

slide33
Data Mining Seminar 2003

General idea (reminder)

  • Say ABCD and AB are large itemsets
  • Compute

conf = support(ABCD) / support(AB)

  • If conf >= minconf
    • AB  CD holds.
discovering rules
Data Mining Seminar 2003Discovering Rules
  • For every large itemset l
    • Find all non-empty subsets of l.
    • For every subset a
      • Produce rule a  (l-a)
      • Accept if support(l) / support(a) >= minconf
checking the subsets
Data Mining Seminar 2003Checking the subsets
  • For efficiency, generate subsets using recursive DFS. If a subset ‘a’ doesn’t produce a rule, we don’t need to check for subsets of ‘a’.

Example

Given itemset : ABCD

If ABC  D doesn’t have enough confidence then surely AB  CD won’t hold

slide36
Data Mining Seminar 2003Why?

For any subset a^ of a:Support(a^) >= support(a) Confidence ( a^ (l-a^) ) =support(l) / support(a^) <=support(l) / support(a) =confidence ( a  (l-a) )

simple algorithm
Data Mining Seminar 2003Simple Algorithm

Check all the large itemsets

Check all the subsets

Check confidence of new rule

Output the rule

Continue the DFS over the subsets.

If there is no confidence the DFS branch cuts here

faster algorithm
Data Mining Seminar 2003Faster Algorithm

Idea:

If (l-c)  c holds than all the rules

(l-c^)  c^ must hold

Example:

If AB  CD holds,

then so do ABC  D and ABD  C

C^ is a non empty subset of c

faster algorithm1
Data Mining Seminar 2003Faster Algorithm
  • From a large itemset l,
    • Generate all rules with one item in it’s consequent.
  • Use those consequents and Apriori-gen to generate all possible 2 item consequents.
  • Etc.
  • The candidate set of the faster algorithm is a subset of the candidate set of the simple algorithm.
faster algorithm2
Data Mining Seminar 2003Faster algorithm

Find all 1 item consequents (using 1 pass of the simple algorithm)

Generate new (m+1)-consequents

Check the support of the new rule

Continue for bigger consequents

If a consq. Doesn’t hold, don’t look for bigger.

advantage1
Data Mining Seminar 2003Advantage

Example

Large itemset : ABCDE

One item conseq. : ACDEB ABCED

Simple algorithm will check:

ABCDE, ABECD, BCEAD and ACEBD.

Faster algorithm will check:

ACEBD which is also the only rule that holds.

slide42

Example

Data Mining Seminar 2003

Simple algorithm:

ABCDE

Large itemset

ACDEB

Rules with minsup

ABCED

CDEAB

BCEAD

ABECD

ADEBC

ACDBE

ACEBD

ACEBD

ABCED

ABCDE

Fast algorithm:

ACDEB

ABCED

ACEBD

results
Data Mining Seminar 2003Results
  • Compare Apriori, and AprioriTid performances to each other, and to previous known algorithms:
    • AIS
    • SETM
  • The algorithms differ in the method of generating all large itemsets.

Both methods generate candidates “on-the-fly”

Designed for use over SQL

method
Data Mining Seminar 2003Method
  • Check the algorithms on the same databases
    • Synthetic data
    • Real data
synthetic data
Data Mining Seminar 2003Synthetic Data
  • Choose the parameters to be compared.
    • Transaction sizes, and large itemsets sizes are each clustered around a mean.
    • Parameters for data generation
      • D – Number of transactions
      • T – Average size of the transaction
      • I – Average size of the maximal potentially large itemsets
      • L – Number of maximal potentially large itemsets
      • N – Number of Items.
synthetic data1
Data Mining Seminar 2003Synthetic Data
  • Expriment values:
    • N = 1000
    • L = 2000
  • T5.I2.D100k
  • T10.I2.D100k
  • T10.I4.D100k
  • T20.I2.D100k
  • T20.I4.D100k
  • T20.I6.D100k

D – Number of transactions

T – Average size of the transaction

I – Average size of the maximal potentially large itemsets

L – Number of maximal potentially large itemsets

N – Number of Items.

T=5, I=2, D=100,000

slide47
Data Mining Seminar 2003
  • SETM values are too big to fit the graphs.
  • Apriori always beats AIS
  • Apriori is better than AprioriTid in large problems

D – Number of transactions

T – Average size of the transaction

I – Average size of the maximal potentially large itemsets

explaining the results
Data Mining Seminar 2003Explaining the Results
  • AprioriTid uses C^k instead of the database. If C^k fits in memory AprioriTid is faster than Apriori.
  • When C^k is too big it cannot sit in memory, and the computation time is much longer. Thus Apriori is faster than AprioriTid.
reality check
Data Mining Seminar 2003Reality Check
  • Retail sales
    • 63 departments
    • 46873 transactions (avg. size 2.47)
  • Small database, C^k fits in memory.
reality check1
Data Mining Seminar 2003Reality Check

Mail Customer

15836 items

213972 transactions (avg size 31)

Mail Order

15836 items

2.9 million transactions (avg size 2.62)

so who is better
Data Mining Seminar 2003So who is better?
  • Look At the Passes.

At final stages, C^k is small enough to fit in memory

algorithm apriorihybrid
Data Mining Seminar 2003Algorithm AprioriHybrid
  • Use Apriori in initial passes
  • Estimate the size of C^k
  • Switch to AprioriTid when C^k is expected to fit in memory
  • The switch takes time, but it is still better in most cases.
conclusions
Data Mining Seminar 2003Conclusions
  • The Apriori algorithms are better than the previous algorithms.
    • For small problems by factors
    • For large problems by orders of magnitudes.
  • The algorithms are best combined.
  • The algorithm shows good results in scale-up experiments.
summary
Data Mining Seminar 2003Summary
  • Association rules are an important tool in analyzing databases.
  • We’ve seen an algorithm which finds all association rules in a database.
  • The algorithm has better time results then previous algorithms.
  • The algorithm maintains it’s performances for large databases.