frequent itemsets association rules and market basket analysis l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Frequent Itemsets Association rules and market basket analysis PowerPoint Presentation
Download Presentation
Frequent Itemsets Association rules and market basket analysis

Loading in 2 Seconds...

play fullscreen
1 / 53

Frequent Itemsets Association rules and market basket analysis - PowerPoint PPT Presentation


  • 275 Views
  • Uploaded on

Frequent Itemsets Association rules and market basket analysis. CS240B--UCLA Notes by Carlo Zaniolo Most slides borrowed from Jiawei Han,UIUC. May 2007. Association Rules & Correlations. Basic concepts Efficient and scalable frequent itemset mining methods: Apriori, and improvements

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Frequent Itemsets Association rules and market basket analysis' - niveditha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
frequent itemsets association rules and market basket analysis

Frequent ItemsetsAssociation rules and market basket analysis

CS240B--UCLA

Notes by Carlo Zaniolo

Most slides borrowed fromJiawei Han,UIUC

May 2007

association rules correlations
Association Rules & Correlations
  • Basic concepts
  • Efficient and scalable frequent itemset mining methods:
    • Apriori, and improvements
    • FP-growth
  • Rule derivation, visualization and validation
  • Multi-level Associations
  • Temporal associations and frequent sequences
  • Other association mining methods
  • Summary
market basket analysis the context
Market Basket Analysis: the context
  • Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket”

Milk, eggs, sugar, bread

Milk, eggs, cereal, bread

Eggs, sugar

Customer1

Customer2

Customer3

market basket analysis the context4
Market Basket Analysis: the context
  • Given: a database of customer transactions, where each transaction is a set of items
    • Find groups of items which are frequently purchased together
goal of mba
Goal of MBA
  • Extract information onpurchasing behavior
  • Actionable information: can suggest
    • new store layouts
    • new product assortments
    • which products to put on promotion
  • MBA applicable whenever a customer purchases multiple things in proximity
    • credit cards
    • services of telecommunication companies
    • banking services
    • medical treatments
mba applicable to many other contexts
MBA: applicable to many other contexts
  • Telecommunication:
    • Each customer is a transaction containing the set of customer’s phone calls
  • Atmospheric phenomena:
    • Each time interval (e.g. a day) is a transaction containing the set of observed event (rains, wind, etc.)
  • Etc.
association rules
Association Rules
  • Express how product/services relate to each other, and tend to group together
  • “if a customer purchases three-way calling, then will also purchase call-waiting”
  • simple to understand
  • actionable information: bundle three-way calling and call-waiting in a single package
frequent itemsets
Frequent Itemsets
  • Transaction:
    • Relational format Compact format
    • <Tid,item><Tid,itemset>
    • <1, item1> <1, {item1,item2}>
    • <1, item2> <2, {item3}>
    • <2, item3>
  • Item: single element, Itemset: set of items
  • Supportof an itemset I: # of transaction containing I
  • Minimum Support : threshold for support
  • Frequent Itemset : with support .
  • Frequent Itemsets represents set of items which are positively correlated
frequent itemset s example
Frequent Itemsets Example

Support({dairy}) = 3 (75%)

Support({fruit}) = 3 (75%)

Support({dairy, fruit}) = 2 (50%)

If = 60%, then

{dairy}and{fruit}are frequent while {dairy, fruit} is not.

itemset support rules confidence
Itemset support & Rules confidence
  • Let A and B be disjoint itemsets and let: s = support(AB) and

c= support(AB)/support(A)

Then the rule A B holds with support s and confidence c: write A B [s, c]

Objective of the mining task. Find all rules with

  • minimum support
  • minimum confidence
  • Thus A B [s, c]holds if : s  andc  
association rules meaning
Association Rules: Meaning

A B [ s, c ]

Support: denotes the frequency of the rule within transactions. A high value means that the rule involve a great part of database.

support(A  B [ s, c ]) = p(A  B)

Confidence: denotes the percentage of transactions containingA which contain also B. It is an estimation of conditioned probability .

confidence(A  B [ s, c ]) = p(B|A) = p(A & B)/p(A).

association rules example
Association Rules - Example

For rule AC:

support = support({A, C}) = 50%

confidence = support({A, C})/support({A}) = 66.6%

The Apriori principle:

Any subset of a frequent itemset must be frequent

Min. support 50%

Min. confidence 50%

closed patterns and max patterns
Closed Patterns and Max-Patterns
  • A long pattern contains very many subpatterns---combinatorial explosion
    • Closed patterns and max-patterns
  • An itemset is closed if none of its supersets has the same support
    • Closed pattern is a lossless compression of freq. patterns--Reducing the # of patterns and rules
  • An itemset is maximal frequent if none of its supersets is frequent
    • But support of their subsets is not known – additional DB scans are needed
frequent itemsets14
Frequent Itemsets

Minimum support = 2

null

124

123

1234

245

345

A

B

C

D

E

12

124

24

123

4

2

3

24

34

45

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

12

24

2

2

4

4

3

4

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

# Frequent = 13

4

2

ABCD

ABCE

ABDE

ACDE

BCDE

ABCDE

maximal frequent itemset if none of its supersets is frequent
Maximal Frequent Itemset: if none of its supersets is frequent

Minimum support = 2

null

124

123

1234

245

345

A

B

C

D

E

12

124

24

123

4

2

3

24

34

45

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

12

24

2

2

4

4

3

4

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

# Frequent = 13

# Maximal = 4

4

2

ABCD

ABCE

ABDE

ACDE

BCDE

ABCDE

closed frequent itemset none of its superset has the same support
Closed Frequent Itemset: None of its superset has the same support

null

124

123

1234

245

345

A

B

C

D

E

12

124

24

123

4

2

3

24

34

45

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

12

24

2

2

4

4

3

4

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

4

2

ABCD

ABCE

ABDE

ACDE

BCDE

ABCDE

Closed but not maximal

Minimum support = 2

Closed and maximal

# Frequent = 13# Closed = 9# Maximal = 4

Closed and maximal

maximal vs closed itemsets
Maximal vs Closed Itemsets

1

2

3

  • As we move from an itemset A to its superset support can:
  • Remain the same,
  • Drop but still remain above treshold, A is closed but not maximal
  • Drop below the threshold: A is maximal (and closed)
scalable methods for mining frequent patterns
Scalable Methods for Mining Frequent Patterns
  • The downward closure property of frequent patterns
    • Every subset of a frequent itemset must be frequent [antimonotonic property]
    • If {beer, diaper, nuts} is frequent, so is {beer, diaper}
    • i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper}
  • Scalable mining methods: Three major approaches
    • Apriori (Agrawal & Srikant@VLDB’94)
    • Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
    • Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
apriori a candidate generation and test approach
Apriori: A Candidate Generation-and-Test Approach
  • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
  • Method:
    • Initially, scan DB once to get frequent 1-itemset
    • Generate length (k+1) candidate itemsets from length k frequent itemsets
    • Test the candidates against DB
    • Terminate when no frequent or candidate set can be generated
association rules correlations20
Association Rules & Correlations
  • Basic concepts
  • Efficient and scalable frequent itemset mining methods:
    • Apriori, and improvements
the apriori algorithm an example
The Apriori Algorithm—An Example

Supmin = 2

Database TDB

L1

C1

1st scan

C2

C2

L2

2nd scan

L3

C3

3rd scan

important details of apriori
Important Details of Apriori
  • How to generate candidates?
    • Step 1: self-joining Lk
    • Step 2: pruning
  • How to count supports of candidates?
  • Example of Candidate-generation
    • L3={abc, abd, acd, ace, bcd}
    • Self-joining: L3*L3
      • abcd from abc and abd
      • acde from acd and ace
    • Pruning:
      • acde is removed because ade is not in L3
    • C4={abcd}
how to generate candidates
How to Generate Candidates?
  • Suppose the items in Lk-1 are listed in an order
  • Step 1: self-joining Lk-1

insert intoCk

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

  • Step 2: pruning

forall itemsets c in Ckdo

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

how to count supports of candidates
How to Count Supports of Candidates?
  • Why counting supports of candidates a problem?
    • The total number of candidates can be very huge
    • One transaction may contain many candidates
  • Data Structures used:
    • Candidate itemsets can be stored in a hash-tree
    • or in a prefix-tree (trie)--example
effect of support distribution
Effect of Support Distribution
  • Many real data sets have skewed support distribution

Support distribution of a retail data set

effect of support distribution26
Effect of Support Distribution
  • How to set the appropriate minsup threshold?
    • If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products)
    • If minsup is set too low, it is computationally expensive and the number of itemsets is very large
  • Using a single minimum support threshold may not be effective
rule generation
Rule Generation
  • How to efficiently generate rules from frequent itemsets?
    • In general, confidence does not have an anti-monotone property

c(ABC D) can be larger or smaller than c(AB D)

    • But confidence of rules generated from the same itemset has an anti-monotone property
    • e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD)
      • Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
rule generation28
Rule Generation
  • Given a frequent itemset L, find all non-empty subsets f  L such that f  L–f satisfies the minimum confidence requirement
  • If |L| = k, then there are 2k candidate association rules (including L   and   L)
    • Example: L= {A,B,C,D} is the frequent itemset, then
    • The candidate rules are:

ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC  BD, AD  BC, BC AD, BD AC, CD AB,

But antimonotonicity will make things converge fast.

lattice of rules confidence f l f support l support f

Pruned Rules

Lattice of rules: confidence(f  L–f)=support(L)/support(f)

L={A,B,C,D}

L= f

Low Confidence Rule

rule generation for apriori algorithm
Rule Generation for Apriori Algorithm
  • Candidate rule is generated by merging two rules that share the same prefixin the rule consequent
  • join(CD=>AB,BD=>AC)would produce the candidaterule D => ABC
  • Prune rule D=>ABC if itssubset AD=>BC does not havehigh confidence.
  • Finally check the validity of rule D=>ABC (This is not an expensive operation so we might skip 3)
rules some useful some trivial others unexplicable
Rules: some useful, some trivial, others unexplicable
  • Useful: “On Thursdays, grocery store consumers often purchase diapers and beer together”.
  • Trivial: “Customers who purchase maintenance agreements are very likely to purchase large appliances”.
  • Unexplicable: “When a new hardaware store opens, one of the most sold items is toilet rings.”

Conclusion: Inferred rules must be validate by domain expert, before they can be used in the marketplace: Post Mining of association rules.

mining for association rules
Mining for Association Rules

The main steps in the process

  • Select a minimum support/confidence level
  • Find the frequent itemsets
  • Find the association rules
  • Validate (postmine) the rules so found.
mining for association rules checkpoint
Mining for Association Rules: Checkpoint
  • Apriori opened up a big commercial market for DM
    • association rules came from the db fields, classifier from AI, clustering precedes both … and DM
  • Many open problem areas, including
    • Performance: Faster Algorithms needed for frequent itemsets
    • Improving statistical/semantic significance of rules
    • Data Stream Mining for association rules. Even Faster algorithms needed, incremental computation, adaptability, etc. Also the post-mining process becomes more challenging.
performance efficient implementation apriori in sql
Performance: Efficient Implementation Apriori in SQL
  • Hard to get good performance out of pure SQL (SQL-92) based approaches alone
  • Make use of object-relational extensions like UDFs, BLOBs, Table functions etc.
    • S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD’98
    • A much better solution: use UDAs—native or imported.

Haixun Wang and Carlo Zaniolo: ATLaS: A Native Extension of SQL for Data Mining. SIAM International Conference on Data Mining 2003, San Francisco, CA, May 1-3, 2003

performance for apriori
Performance for Apriori
  • Challenges
    • Multiple scans of transaction database [not for data streams]
    • Huge number of candidates
    • Tedious workload of support counting for candidates
  • Many Improvements suggested: general ideas
    • Reduce passes of transaction database scans
    • Shrink number of candidates
    • Facilitate counting of candidates
partition scan database only twice
Partition: Scan Database Only Twice
  • Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
    • Scan 1: partition database and find local frequent patterns
    • Scan 2: consolidate global frequent patterns
  • A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95
  • Does this scaleup to larger partitions?
sampling for frequent patterns
Sampling for Frequent Patterns
  • Select a sample S of original database, mine frequent patterns within sample using Apriori
  • To avoid losses mine for a support less than that required
  • Scan rest of database to find exact counts.
  • H. Toivonen. Sampling large databases for association rules. In VLDB’96
dic reduce number of scans
DIC: Reduce Number of Scans

Once both A and D are determined frequent, the counting of AD begins

Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins

ABCD

ABC

ABD

ACD

BCD

AB

AC

BC

AD

BD

CD

Transactions

1-itemsets

B

C

D

A

2-itemsets

Apriori

{}

Itemset lattice

1-itemsets

S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97

2-items

DIC

3-items

improving performance cont
Improving Performance (cont.)
  • APriori Multiple database scans are costly
  • Mining long patterns needs many passes of scanning and generates lots of candidates
    • To find frequent itemset i1i2…i100
      • # of scans: 100
      • # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 !
  • Bottleneck: candidate-generation-and-test
  • Can we avoid candidate generation?
mining frequent patterns without candidate generation
Mining Frequent Patterns Without Candidate Generation
  • FP-Growth Algorithm
    • Build FP-tree: items are listed by decreasing frequency
    • For each suffix (recursively)
      • Build its conditionalized subtree
      • and compute its frequent items
  • An order of magnitude faster than Apriori
frequent patterns fp algorithm
Frequent Patterns (FP) Algorithm

_________________________________________These slides are based on those by:Yousry Taha,Taghrid Al-Shallali, Ghada AL Modaifer ,Nesreen AL Boiez

  • The algorithm consists of two steps:
    • Step 1:
    • builds the FP-Tree (Frequent Patterns Tree).
    • Step 2:
    • use FP_Growth Algorithm for finding frequent itemsets from the FP- Tree.

41

frequent pattern tree algorithm example
Frequent Pattern Tree Algorithm:Example

The first scan of database is same as Apriori, which derives the set of 1-itemsets & their support counts.

The set of frequent items is sorted in the order of descending support count.

An Fp-tree is constructed

The Fp-tree is conditionalized and mined for frequent itemsets

42

fp tree for
FP-Tree for

Table: Item header table

NULL

FP-tree

Milk:1

Milk:2

Milk:3

Bread:1

Bread:1

Juice:1

Cookies:1

Cookies:1

Juice:1

43

fp growth algorithm for finding frequent itemsets
FP-Growth Algorithm For Finding Frequent Itemsets

Steps:

Start from each frequent length-1 pattern (as an initial suffix pattern).

Construct its conditional pattern base which consists of the set of prefix paths in the FP-Tree co-occurring with suffix pattern.

Then, Construct its conditional FP-Tree & perform mining on such a tree.

The pattern growth is achieved by concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-Tree.

The union of all frequent patterns (generated by step 4) gives the required frequent itemset.

44

slide45

FP-Growth: for each suffix find (1) its supporting paths, (2) its conditional FP-tree, and (3) the frequent patterns with such an ending (suffix)

… then expand the suffix and repeat these operations

45

starting from least frequent suffix juice
Starting from least frequent suffix: Juice

NULL

Milk:1

Milk:2

Milk:3

Bread:1

Bread:1

Juice:1

Cookies:1

Cookies:1

Juice:1

NULL

Milk:1

Milk:2

Milk:3

2

Bread:1

Juice:1

Cookies:1

Juice:1

46

conditionalized tree for suffix juice
Conditionalized tree for Suffix “Juice”

NULL

Milk:2

Thus: (Juice, Milk:2) is a frequent pattern

47

now patterns with suffix cookies
Now Patterns with Suffix “Cookies”

NULL

Milk:1

Milk:2

Milk:3

Bread:1

Bread:1

Cookies:1

Cookies:1

NULL

Milk:1

Milk:2

Milk:1

Bread:1

Bread:1

NULL

Thus: (Cookies, Bread:2) is frequent

Bread:2

48

why frequent pattern growth fast
Why Frequent Pattern Growth Fast ?

Performance study shows

FP-growth is an order of magnitude faster than Apriori

Reasoning

No candidate generation, no candidate test

Use compact data structure

Eliminate repeated database scan

Basic operation is counting and FP-tree building

49

other types of association rules
Other types of Association RULES

Association Rules among Hierarchies.

Multidimensional Association

Negative Association

50

fp growth vs apriori scalability with number of transactions
FP-growth vs. Apriori: Scalability With Number of Transactions

Data set T25I20D100K (1.5%)

fp growth pros and cons
FP-Growth: pros and cons
  • FP- tree is Complete
    • Preserve complete information for frequent pattern mining
    • Never break a long pattern of any transaction
  • FP- tree Compact
    • Reduce irrelevant info—infrequent items are gone
    • Items in frequency descending order: the more frequently occurring, the more likely to be shared
    • Never be larger than the original database (not count node-links and the count field)
  • FP-tree is generate in one scan of database (data streams mining?)
    • However, deriving the frequent patterns from the FP-tree is still computationally expensive—improved algorithms needed for data streams.