business systems intelligence 4 mining association rules n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Business Systems Intelligence: 4. Mining Association Rules PowerPoint Presentation
Download Presentation
Business Systems Intelligence: 4. Mining Association Rules

Loading in 2 Seconds...

play fullscreen
1 / 46

Business Systems Intelligence: 4. Mining Association Rules - PowerPoint PPT Presentation


  • 121 Views
  • Uploaded on

Dr. Brian Mac Namee ( www.comp.dit.ie/bmacnamee ). Business Systems Intelligence: 4. Mining Association Rules. Acknowledgments. These notes are based (heavily) on those provided by the authors to accompany “Data Mining: Concepts & Techniques” by Jiawei Han and Micheline Kamber

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Business Systems Intelligence: 4. Mining Association Rules' - gada


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
acknowledgments
Acknowledgments
  • These notes are based (heavily) on those provided by the authors to accompany “Data Mining: Concepts & Techniques” by Jiawei Han and Micheline Kamber
  • Some slides are also based on trainer’s kits provided by

More information about the book is available at:www-sal.cs.uiuc.edu/~hanj/bk2/

And information on SAS is available at:www.sas.com

mining association rules
Mining Association Rules
  • Today we will look at:
    • Association rule mining
    • Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases
    • Sequential pattern mining
    • Applications/extensions of frequent pattern mining
    • Summary
what is association mining

Frequent Pattern: A pattern (set of items, sequence, etc.) that occurs frequently in a database

What Is Association Mining?
  • Association rule mining:
    • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories
motivations for association mining
Motivations For Association Mining
  • Motivation: Finding regularities in data
    • What products were often purchased together?
      • Beer and nappies!
    • What are the subsequent purchases after buying a PC?
    • What kinds of DNA are sensitive to this new drug?
    • Can we automatically classify web documents?
motivations for association mining cont
Motivations For Association Mining (cont…)
  • Foundation for many essential data mining tasks
    • Association, correlation, causality
    • Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association
    • Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression)
motivations for association mining cont1
Motivations For Association Mining (cont…)
  • Broad applications
    • Basket data analysis, cross-marketing, catalog design, sale campaign analysis
    • Web log (click stream) analysis, DNA sequence analysis, etc.
market basket analysis
Market Basket Analysis
  • Market basket analysis is a typical example of frequent itemset mining
  • Customers buying habits are divined by finding associations between different items that customers place in their “shopping baskets”
  • This information can be used to develop marketing strategies
association rule basic concepts
Association Rule Basic Concepts
  • Let I be a set of items {I1, I2, I3,…, Im}
  • Let D be a database of transactions where each transaction T is a set of items such that TI
  • So, if A is a set of items a transaction T is said to contain A if and only if A T
  • An association rule is an implication AB where AI, BI, and A B=
association rule support confidence
Association Rule Support & Confidence
  • We say that an association rule AB holds in the transaction set D with support, s, and confidence, c
  • The support of the association rule is given as the percentage of transactions in D that contain both A and B (or A B)
  • So, the support can be considered the probability P(A B)
association rule support confidence cont
Association Rule Support & Confidence (cont…)
  • The confidence of the association rule is given as the percentage of transactions in D containing A that also contain B
  • So, the confidence can be considered the conditional probability P(B|A)
  • Association rules that satisfy minimum support and confidence values are said to be strong
itemsets frequent itemsets
Itemsets & Frequent Itemsets
  • An itemset is a set of items
  • A k-itemset is an itemset that contains k items
  • The occurrence frequency of an itemset is the number of transactions that contain the itemset
    • This is also known more simply as the frequency, support count or count
  • An itemset is said to be frequent if the support count satisfies a minimum support count threshold
  • The set of frequent itemsets is denoted Lk
support confidence again
Support & Confidence Again
  • Support and confidence values can be calculated as follows:
association rule mining
Association Rule Mining
  • So, in general association rule mining can be reduced to the following two steps:
    • Find all frequent itemsets
      • Each itemset will occur at least as frequently as as a minimum support count
    • Generate strong association rules from the frequent itemsets
      • These rules will satisfy minimum support and confidence measures
combinatorial explosion
Combinatorial Explosion!
  • A major challenge in mining frequent itemsets is that the number of frequent itemsets generated can be massive
  • For example, a long frequent itemset will contain a combinatorial number of shorter frequent sub-itemsets
  • A frequent itemset of length 100 will contains the following number of frequent sub-itemsets:
the apriori algorithm
The Apriori Algorithm
  • Any subset of a frequent itemset must be frequent
    • If {beer, nappy, nuts} is frequent, so is {beer, nappy}
    • Every transaction having {beer, nappy, nuts} also contains {beer, nappy}
  • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!
the apriori algorithm cont
The Apriori Algorithm (cont…)
  • The Apriori algorithm is known as a candidate generation-and-test approach
  • Method:
    • Generate length (k+1) candidate itemsets from length k frequent itemsets
    • Test the candidates against the DB
  • Performance studies show the algorithm’s efficiency and scalability
the apriori algorithm an example
The Apriori Algorithm: An Example

Database TDB

L1

C1

1st scan

C2

C2

L2

2nd scan

L3

C3

3rd scan

important details of the apriori algorithm
Important Details Of The Apriori Algorithm
  • There are two crucial questions in implementing the Apriori algorithm:
    • How to generate candidates?
    • How to count supports of candidates?
generating candidates
Generating Candidates
  • There are 2 steps to generating candidates:
    • Step 1: Self-joining Lk
    • Step 2: Pruning
  • Example of Candidate-generation
    • L3={abc, abd, acd, ace, bcd}
    • Self-joining: L3*L3
      • abcd from abc and abd
      • acde from acd and ace
    • Pruning:
      • acde is removed because ade is not in L3
    • C4={abcd}
how to count supports of candidates
How to Count Supports Of Candidates?
  • Why counting supports of candidates a problem?
    • The total number of candidates can be huge
    • One transaction may contain many candidates
  • Method:
    • Candidate itemsets are stored in a hash-tree
    • Leaf node of hash-tree contains a list of itemsets and counts
    • Interior node contains a hash table
    • Subset function: finds all the candidates contained in a transaction
generating association rules
Generating Association Rules
  • Once all frequent itemsets have been found association rules can be generated
  • Strong association rules from a frequent itemset are generated by calculating the confidence in each possible rule arising from that itemset and testing it against a minimum confidence threshold
challenges of frequent pattern mining
Challenges Of Frequent Pattern Mining
  • Challenges
    • Multiple scans of transaction database
    • Huge number of candidates
    • Tedious workload of support counting for candidates
  • Improving Apriori: general ideas
    • Reduce passes of transaction database scans
    • Shrink number of candidates
    • Facilitate support counting of candidates
bottleneck of frequent pattern mining
Bottleneck Of Frequent-Pattern Mining
  • Multiple database scans are costly
  • Mining long patterns needs many passes of scanning and generates lots of candidates
    • To find frequent itemset i1i2…i100
      • # of scans: 100
      • # of Candidates: + + … + = 2100-1 = 1.27*1030
  • Bottleneck: candidate-generation-and-test
mining frequent patterns without candidate generation
Mining Frequent Patterns Without Candidate Generation
  • Techniques for mining frequent itemsets which avoid candidate generation include:
    • FP-growth
      • Grow long patterns from short ones using local frequent items
    • ECLAT (Equivalence CLASS Transformation) algorithm
      • Uses a data representation in which transactions are associated with items, rather than the other way around (vertical data format)
  • These methods can be much faster than the Apriori algorithm
sequence databases and sequential pattern analysis
Sequence Databases and Sequential Pattern Analysis
  • Frequent patterns vs. (frequent) sequential patterns
  • Applications of sequential pattern mining
    • Customer shopping sequences:
      • First buy computer, then CD-ROM, and then digital camera, within 3 months.
    • Medical treatment, natural disasters (e.g., earthquakes), science & engineering processes, stocks and markets, etc.
    • Telephone calling patterns, Weblog click streams
    • DNA sequences and gene structures
what is sequential pattern mining
What Is Sequential Pattern Mining?
  • Given a set of sequences, find the complete set of frequentsubsequences

A sequence : < (ef) (ab) (df) c b >

A sequence database

An element may contain a set of items.

Items within an element are unordered

and we list them alphabetically.

<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern

challenges on sequential pattern mining
Challenges On Sequential Pattern Mining
  • A huge number of possible sequential patterns are hidden in databases
  • A mining algorithm should
    • Find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold
    • Be highly efficient, scalable, involving only a small number of database scans
    • Be able to incorporate various kinds of user-specific constraints
a basic property of sequential patterns apriori

Seq. ID

  • Sequence
  • 10
  • <(bd)cb(ac)>
  • 20
  • <(bf)(ce)b(fg)>
  • 30
  • <(ah)(bf)abf>
  • 40
  • <(be)(ce)d>
  • 50
  • <a(bd)bcb(ade)>
A Basic Property Of Sequential Patterns: Apriori
  • A basic property: Apriori
    • If a sequence S is not frequent
    • Then none of the super-sequences of S is frequent
    • E.g, <hb> is infrequent  so are <hab> and <(ah)b>

Given support thresholdmin_sup =2

gsp a generalized sequential pattern mining algorithm
GSP—A Generalized Sequential Pattern Mining Algorithm
  • GSP (Generalized Sequential Pattern) mining algorithm proposed in 1996
  • Outline of the method
    • Initially, every item in DB is a candidate of length 1
    • For each level (i.e., sequences of length k):
      • Scan database to collect support count for each candidate sequence
      • Generate candidate length (k+1) sequences from length k frequent sequences using Apriori
    • Repeat until no frequent sequence or no candidate can be found
  • Major strength: Candidate pruning by Apriori
finding length 1 sequential patterns

Seq. ID

  • Sequence
  • 10
  • <(bd)cb(ac)>

min_sup =2

  • 20
  • <(bf)(ce)b(fg)>
  • 30
  • <(ah)(bf)abf>
  • 40
  • <(be)(ce)d>
  • 50
  • <a(bd)bcb(ade)>
Finding Length 1 Sequential Patterns
  • Examine GSP using an example
  • Initial candidates: all singleton sequences
    • <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>
  • Scan database once, count support for candidates
generating length 2 candidates
Generating Length 2 Candidates

51 length-2

Candidates

Without Apriori property,

8*8+8*7/2=92 candidates

Apriori prunes

44.57% candidates

finding length 2 sequential patterns
Finding Length 2 Sequential Patterns
  • Scan database one more time, collect support count for each length 2 candidate
  • There are 19 length 2 candidates which pass the minimum support threshold
    • They are length 2 sequential patterns
generating length 3 candidates and finding length 3 patterns
Generating Length 3 Candidates And Finding Length 3 Patterns
  • Generate length 3 candidates
    • Self-join length 2 sequential patterns
      • Based on the Apriori property
      • <ab>, <aa> and <ba> are all length 2 sequential patterns  <aba> is a length-3 candidate
      • <(bd)>, <bb> and <db> are all length 2 sequential patterns  <(bd)b> is a length-3 candidate
    • 46 candidates are generated
  • Find length 3 sequential patterns
    • Scan database once more, collect support counts for candidates
    • 19 out of 46 candidates pass support threshold
the gsp mining process

Seq. ID

  • Sequence

Cand. cannot pass sup. threshold

5th scan: 1 cand. 1 length-5 seq. pat.

<(bd)cba>

  • 10
  • <(bd)cb(ac)>
  • 20
  • <(bf)(ce)b(fg)>

Cand. not in DB at all

<abba> <(bd)bc> …

4th scan: 8 cand. 6 length-4 seq. pat.

  • 30
  • <(ah)(bf)abf>

3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all

<abb> <aab> <aba> <baa><bab> …

  • 40
  • <(be)(ce)d>

2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all

  • 50
  • <a(bd)bcb(ade)>

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>

1st scan: 8 cand. 6 length-1 seq. pat.

<a> <b> <c> <d> <e> <f> <g> <h>

The GSP Mining Process

min_sup =2

the gsp algorithm
The GSP Algorithm
  • Take sequences in form of <x> as length 1 candidates
  • Scan database once, find F1, the set of length 1 sequential patterns
  • Let k=1; while Fk is not empty do
    • Form Ck+1, the set of length (k+1) candidates from Fk
    • If Ck+1 is not empty, scan database once, find Fk+1, the set of length (k+1) sequential patterns
    • Let k=k+1
bottlenecks of gsp
Bottlenecks of GSP
  • A huge set of candidates could be generated
    • 1,000 frequent length 1 sequences generate length 2 candidates!
  • Multiple scans of database in mining
  • Real challenge: mining long sequential patterns
    • An exponential number of short candidates
    • A length-100 sequential pattern needs 1030 candidate sequences!
improvements on gsp
Improvements On GSP
  • Freespan:
    • Projection-based: No candidate sequence needs to be generated
    • But, projection can be performed at any point in the sequence, and the projected sequences will not shrink much
  • PrefixSpan
    • Projection-based
    • But only prefix-based projection: less projections and quickly shrinking sequences
frequent pattern mining achievements
Frequent-Pattern Mining: Achievements
  • Frequent pattern mining—an important task in data mining
  • Frequent pattern mining methodology
    • Candidate generation & test vs. projection-based (frequent-pattern growth)
    • Various optimization methods: database partition, scan reduction, hash tree, sampling, border computation, clustering, etc.
  • Related frequent-pattern mining algorithm: scope extension
    • Mining closed frequent itemsets and max-patterns (e.g., MaxMiner, CLOSET, CHARM, etc.)
    • Mining multi-level, multi-dimensional frequent patterns with flexible support constraints
    • Constraint pushing for mining optimization
    • From frequent patterns to correlation and causality
frequent pattern mining research problems
Frequent-Pattern Mining: Research Problems
  • Multi-dimensional gradient analysis: patterns regarding changes and differences
    • Not just counts—other measures, e.g., avg(profit)
  • Mining top-k frequent patterns without support constraint
  • Mining fault-tolerant associations
    • “3 out of 4 courses excellent” leads to A in data mining
  • Fascicles and database compression by frequent pattern mining
  • Partial periodic patterns
  • DNA sequence analysis and pattern classification