data mining association rules l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Mining Association Rules PowerPoint Presentation
Download Presentation
Data Mining Association Rules

Loading in 2 Seconds...

play fullscreen
1 / 30

Data Mining Association Rules - PowerPoint PPT Presentation


  • 615 Views
  • Uploaded on

Data Mining Association Rules. Jen-Ting Tony Hsiao Alexandros Ntoulas CS 240B May 21, 2002. Outline. Overview of data mining association rules Algorithm Apriori FP-Trees. Data Mining Association Rules.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Mining Association Rules' - lotus


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data mining association rules

Data Mining Association Rules

Jen-Ting Tony Hsiao

Alexandros Ntoulas

CS 240B

May 21, 2002

outline
Outline
  • Overview of data mining association rules
  • Algorithm Apriori
  • FP-Trees
data mining association rules3
Data Mining Association Rules
  • Data mining is motivated by the decision support problem faced by most large retail organizations
  • Bar-code technology made it possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data
  • Successful organizations view such databases as important pieces of the marketing infrastructure
  • Organizations are interested in instituting information-driven marketing processes, managed by database technology, that enable marketers to develop and implement customized marketing programs and strategies
data mining association rules4
Data Mining Association Rules
  • For example, 98% of customers that purchase tires and auto accessories also get automotive services done
  • Finding all such rules is valuable for cross-marketing and attached mailing applications
  • Other applications include catalog design, add-on sales, store layout, and customer segmentation based on buying patterns
  • Databases involved in these applications are very large, thus imperative to have fast algorithms for this task
formal definition of problem
Formal Definition of Problem
  • Formal statement of the problem [AIS93]:
    • Let I = {i1, i2, …, im} be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T I.
    • A transaction T contains X, a set of some items in I, if X  T.
    • An association rule is an implication of the form X  Y, where X  I, Y  I, and X  Y = Ø.
    • X  Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y.
    • X  Y has support s in the transaction set D if s% of transactions in D contain X  Y.
example of confidence and support
Example of Confidence and Support
  • Given the following table of transactions:
    • 1  2 has 100% confidence, with 100% support
    • 3  4 has 100% confidence, with 33% support
    • 2  3 has 33% confidence, with 33% support
problem decomposition
Problem Decomposition
  • Find all sets of items (itemsets) that have transaction support above minimum support. The support for an itemset is the number of transactions that contain the itemset. Itemsets with minimum support are called large itemsets, and all others small itemsets.
  • Use large itemsets to generate the desired rules. If ABCD and AB are large itemsets, then we can determine if rule AB CD holds by computing the ratio conf = support(ABCD)/support(AB). If conf  minconf, then the rule holds. (The rule will surely have minimum support because ABCD is large.)
discoverying large itemsets
Discoverying Large Itemsets
  • Algorithms for discovering large itemsets make multiple passes over the data
  • In the first pass, we count the support of individual items and determine which of them are large (with minimum support)
  • In each subsequent pass, we start with a seed set of itemsets found to be large in the previous pass, then use this seed set for generating new potentially large itemsets, called candidate itemsets, and count the actual support for these candidate itemsets during the pass over the data
  • At the end of the pass, we determine which of the candidate itemsets are actually large, and they become the seed for the next pass.
  • This process continues until no new large itemsets are found
notations
Notations
  • The number of items in an itemset is its size and call an itemset of size k a k-itemset
  • Items within an itemset are kept in lexicographic order
algorithm apriori
Algorithm Apriori
  • L1 = {large 1-itemsets};
  • for ( k = 2; Lk-1 Ø; k++ ) do begin
  • Ck = apriori-gen(Lk-1); // New candidates
  • forall transactions t  Ddo begin
  • Ct = subset(Ck, t); Candidates contained in t
  • forall candidates c  Ctdo
  • c.count++;
  • end
  • Lk = {c  Ck | c.count  minsup}
  • end
  • Answer = kLk;
apriori candidate generation
Apriori Candidate Generation
  • The apriori-gen function takes as argument Lk – 1, the set of all large (k – 1)-itemsets. It returns a superset of the a set of all large k-itemsets.
  • First, in the join step, we join Lk – 1 with Lk – 1:

insert intoCk

selectp.item1, p.item2, …, p.itemk – 1, q.itemk – 1

fromLk – 1 p, Lk – 1q

wherep.item1 = q.item1, …, p.itemk – 2 = q.itemk – 2, p.itemk – 1 q.itemk – 1

  • Next, in the prune step, we delete all itemsets c  Ck such that some (k – 1)-subset of c is not in Lk – 1:

forall itemsets c  Ckdo

forall (k – 1)-subsets s of c do

if (s  Lk – 1) then

delete c from Ck;

apriori gen example
Apriori-gen Example
  • Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}
  • In the join step:
    • {1 2 3} joins with {1 2 4} to produce {1 2 3 4}
    • {1 3 4} joins with {1 3 5} to produce {1 3 4 5}
    • After the join step, C4 will be {{1 2 3 4}, {1 3 4 5}}
  • In the prune step:
    • {1 2 3 4} is tested for existence of 3-item subsets within L3, thus for {1 2 3}, {1 3 4}, and {2 3 4}
    • {1 3 4 5} is tested for {1 3 4}, {1 4 5}, and {3 4 5}, with {1 4 5} not found, and thus this set is deleted
  • We then will be left with only {1 2 3 4} in C4
discovering rules
Discovering Rules
  • For every large itemset l, we find all non-empty subsets of l
  • For every such subset a, we output a rule of the form a (l – a) if the ratio of support(l) to support(a) is at least minconf
  • We consider all subsets of l to generate rules with multiple consequents
simple algorithm
Simple Algorithm
  • We can improve the proceeding procedure by generating the subsets of large itemset in a recursive depth-first fashion
  • For example, given an itemset ABCD, we first consider the subset ABC, then AB, etc
  • Then if a subset a of a large itemset l does not generate a rule, the subsets of a need not to be considered for generating rules using l
  • For example, if ABC D does not have enough confidence, we need not check whether AB  CD holds
simple algorithm15
Simple Algorithm
  • forall large itemsets lk, k 2 do
  • call genrules(lk, lk)
  • Procedure genrules(lk: large k-itemsets, am: large m-itemsets)
  • A = {(m – 1)-itemsets am – 1 | am – 1 am};
  • forallam – 1  Ado again
  • conf = support(lk)/support(am – 1)
  • if (conf  minconf) then begin
  • output the rule am – 1 (lk - am – 1),
  • with confidence = conf and support = support(lk)
  • if (m – 1 > 1) then
  • call genrules(lk, am – 1);
  • // to generate rules with subsets of am – 1
  • as the antecedents
  • end
  • end
faster algorithm
Faster Algorithm
  • Earlier, we showed that if a rule does not hold for a superset does not hold, it also does not hold for its subset
  • For example, if ABC D does not have enough confidence, AB  CD also will not hold
  • This can also be applied in the reverse direction, where if a rule holds for a subset, it also holds for its superset
  • For example, if AB  CD has enough confidence, ABC D will also hold
faster algorithm17
Faster Algorithm
  • forall large itemsets lk, k 2 do begin
  • H1 = { consequents of rules derived from lk with one item in the consequent };
  • call ap-genrules(lk, H1)
  • end
  • Procedure ap-genrules(lk: large k-itemsets, Hm: set of m-item consequents)
  • if (k > m + 1) then begin
  • Hm + 1 = apriori-gen(Hm);
  • forallhm + 1 Hm + 1do begin
  • conf = support(lk)/support(lk - hm + 1);
  • if (conf minconf) then
  • output the rule (lk - hm + 1)  hm + 1 with
  • confidence = conf and support = support(lk);
  • else
  • delete hm + 1 from Hm + 1
  • end
  • call ap-genrules(lk, Hm + 1);
  • end
faster algorithm18
Faster Algorithm
  • For example, consider a large itemset ABCDE. Assume that ACDE B and ABCE  E are the only one-item consequent rules with minimum confidence.
  • In the simple algorithm, the recursive call genrules(ABCDE, ACDE) will test if the two-item consequence rules ACD  BE, ADE  BC, CDE  BA, and ACE  BD hold.
  • The first of these rules cannot hold, because E  BE, and ABCD  E does not have minimum confidence. The second and third rules cannot hold for similar reasons.
  • The only two-item consequent rule that can possibly hold is ACE  BD, where B and D are the consequents in the valid one-item consequent rules. This is the only rule that will be tested by the faster algorithm.
mining frequent itemsets
Mining Frequent Itemsets
  • If we use candidate generation
    • Need to generate a huge number of candidate sets.
    • Need to repeatedly scan the database and check a large set of candidates by pattern matching.
  • Can we avoid that?
    • FP-Trees (Frequent Pattern Trees)
fp tree
FP-Tree
  • General Idea
    • Divide and Conquer
  • Outline of the procedure
    • For each item create its conditional pattern base
    • Then expand it and create its conditional FP-Tree
    • Repeat the process recursively on each constructed FP-Tree
    • Until FP-Tree is either empty or contains only one path
example
Example

TID Items bought

100 {f, a, c, d, g, i, m, p}

200 {a, b, c, f, l, m, o}

300 {b, f, h, j, o}

400 {b, c, k, s, p}

500{a, f, c, e, l, p, m, n}

  • For support = 50%
  • Step 1: Scan DB find frequent 1-itemsets
  • Step 2: Order them in descending order
  • Step 3: Scan DB and construct FP-Tree

Item frequency

f 4

c 4

a 3

b 3

m 3

p 3

example conditional pattern base

root

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

Example – Conditional Pattern Base
  • Start with the frequent header table
  • Traverse tree by following links from frequent items
  • Accumulate prefix paths to form a conditional pattern base

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

Header Table

Item frequency

f 4

c 4

a 3

b 3

m 3

p 3

example conditional fp tree

{root}

f:3

c:3

a:3

m-conditional FP-tree

Example – Conditional FP-Tree
  • For every pattern base
    • Accumulate the frequency for each item
    • Construct an FP-Tree for the frequent items of the pattern base
  • m-conditional pattern base:
    • fca:2, fcab:1

All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

mining the fp tree
Mining The FP-Tree
  • Construct conditional pattern bases for each node in the initial FP-Tree
  • Construct conditional FP-Tree from each conditional pattern base
  • Recursively visit conditional FP-Trees and expand frequent patterns
  • If the conditional FP-Tree contains only one path enumerate all the combinations of the items
example mining the fp tree

Item

Conditional pattern-base

Conditional FP-tree

p

{(fcam:2), (cb:1)}

{(c:3)}|p

m

{(fca:2), (fcab:1)}

{(f:3, c:3, a:3)}|m

b

{(fca:1), (f:1), (c:1)}

NULL

a

{(fc:3)}

{(f:3, c:3)}|a

c

{(f:3)}

{(f:3)}|c

f

NULL

NULL

Example – Mining The FP-Tree
example mining the fp tree26

{root}

f:3

c:3

am-conditional FP-tree

{root}

{root}

f:3

{root}

c:3

f:3

f:3

a:3

cm-conditional FP-tree

cam-conditional FP-tree

m-conditional FP-tree

Example – Mining The FP-Tree
  • Recursively visit conditional FP-Trees
example mining the fp tree27

{root}

f:3

c:3

a:3

Example – Mining The FP-Tree
  • If we have a single path, the frequent itemsets of a tree can be generated by enumerating all combinations of nodes of the tree.

Enumeration of all patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

m-conditional FP-tree

advantages of fp tree
Advantages of FP-Tree
  • Completeness:
    • Never breaks a long pattern of any transaction
    • Preserves complete information for frequent pattern mining
  • Compactness
    • Reduce irrelevant information—infrequent items are gone
    • Frequency descending ordering: more frequent items are more likely to be shared
    • Never be larger than the original database (if not count node-links and counts)
references
References
  • R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. In Proceedings of the 1993 International Conference on Management of Data (SIGMOD 93), pages 207-216, May 1993
  • R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules (1994) Proc. 20th Int. Conf. Very Large Data Bases, VLDB
  • Jiawei Han, Jian Pei, Yiwen Yin. Mining Frequent Patterns without Candidate Generation. SIGMOD Conference 2000: 1-12