Loading in 2 Seconds...

Technologies for Mining Frequent Patterns in Large Databases

Loading in 2 Seconds...

- By
**Mercy** - Follow User

- 758 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Technologies for Mining Frequent Patterns in Large Databases' - Mercy

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Technologies for Mining Frequent Patterns in Large Databases

### Technologies for Mining Frequent Patterns in Large Databases

### Technologies for Mining Frequent Patterns in Large Databases

### Technologies for Mining Frequent Patterns in Large Databases

Tutorial Outline### Part VIMining Frequent Patterns Without Candidate Generation

### Part VIICLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

### Part VIII FreeSpan: Frequent Pattern-projected Sequential Pattern Mining

Jiawei Han

Intelligent Database Systems Research Lab.

Simon Fraser University, Canada

http://www.cs.sfu.ca/~han

Tutorial Outline

- What is frequent pattern mining?
- Frequent pattern mining algorithms
- Apriori and its variations
- A multi-dimensional view of frequent pattern mining
- Constraint-based frequent pattern mining
- Recent progress on efficient mining methods
- Mining frequent patterns without candidate generation
- CLOSET: Efficient mining of frequent closet itemsets
- FreeSpan: Towards efficient sequential pattern mining

Part IWhat Is Frequent Pattern Mining?

- What is frequent pattern?
- Why frequent pattern mining?
- Challenges in frequent pattern mining

What Is Frequent Pattern Mining?

- What is a frequent pattern?
- Pattern (set of items, sequence, etc.) that occurs together frequently in a database [AIS93]
- Frequent pattern: an important form of regularity
- What products were often purchased together? — beers and diapers!
- What are the consequences of a hurricane?
- What is the next target after buying a PC?

Application Examples

- Market Basket Analysis
- * Maintenance Agreement

What the store should do to boost Maintenance Agreement sales

- Home Electronics *

What other products should the store stocks up on if the store has a sale on Home Electronics

- Attached mailing in direct marketing
- Detecting “ping-pong”ing of patients

transaction: patient

item: doctor/clinic visited by a patient

support of a rule: number of common patients

Frequent Pattern Mining—A Corner Stone in Data mining

- Association analysis
- Basket data analysis, cross-marketing, catalog design, loss-leader analysis, text database analysis
- Correlation or causality analysis
- Clustering
- Classification
- Association-based classification analysis
- Sequential pattern analysis
- Web log sequence, DNA analysis, etc.
- Partial periodicity, cyclic/temporal associations

Association Rule Mining

- Given
- A database of customer transactions
- Each transaction is a list of items (purchased by a customer in a visit)
- Find all rules that correlate the presence of one set of items with that of another set of items
- Example: 98% of people who purchase tires and auto accessories also get automotive services done
- Any number of items in the consequent/antecedent of rule
- Possible to specify constraints on rules (e.g., find only rules involving Home Laundry Appliances).

Basic Concepts

- Rule form: “A® B [support s, confidence c]”.

Support: usefulness of discovered rules

Confidence: certainty of the detected association

Rules that satisfy both min_sup and min_conf are called strong.

- Examples:
- buys(x, “diapers”) ® buys(x, “beers”) [0.5%, 60%]
- age(x, “30-34”) ^ income(x ,“42K-48K”) ® buys(x, “high resolution TV”) [2%,60%]
- major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%, 75%]

Rule Measures: Support and Confidence

- Find all the rules X & Y Z with minimum confidence and support
- support,s, probability that a transaction contains {X, Y, Z}
- confidence,c,conditional probability that a transaction having {X, Y} also contains Z.

Customer

buys both

Customer

buys diaper

Customer

buys beer

Let minimum support 50%, and minimum confidence 50%, we have

- A C (50%, 66.6%)
- C A (50%, 100%)

Part IIFrequent pattern mining methods: Apriori and its variations

- The Apriori algorithm
- Improvements of Apriori
- Incremental, parallel, and distributed methods
- Different measures in association mining

An Influential Mining Methodology — The Apriori Algorithm

- The Apriori method:
- Proposed by Agrawal & Srikant 1994
- A similar level-wise algorithm by Mannila et al. 1994
- Major idea:
- A subset of a frequent itemset must be frequent
- E.g., if {beer, diaper, nuts} is frequent, {beer, diaper} must be. Anyone is infrequent, its superset cannot be!
- A powerful, scalable candidate set pruning technique:
- It reduces candidate k-itemsets dramatically (for k > 2)

Mining Association Rules — Example

For rule AC:

support = support({AC}) = 50%

confidence = support({AC})/support({A}) = 66.6%

The Apriori principle:

Any subset of a frequent itemset must be frequent.

Min. support 50%

Min. confidence 50%

Procedure of Mining Association Rules:

- Find the frequent itemsets: the sets of items that have minimum support (Apriori)
- A subset of a frequent itemset must also be a frequent itemset, i.e., if {A B} isa frequent itemset, both {A} and {B} should be a frequent itemset
- Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
- Use the frequent itemsets to generate association rules.

The Apriori Algorithm

- Join Step

Ckis generated by joining Lk-1with itself

- Prune Step

Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset, hence should be removed.

(Ck: Candidate itemset of size k)

(Lk : frequent itemset of size k)

Apriori—Pseudocode

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};

for(k = 1; Lk !=; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end

returnkLk;

How to Generate Candidates?

- Suppose the items in Lk-1 are listed in an order
- Step 1: self-joining Lk-1

insert intoCk

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

- Step 2: pruning

forall itemsets c in Ckdo

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

How to Count Supports of Candidates?

- Why counting supports of candidates a problem?
- The total number of candidates can be very huge
- One transaction may contain many candidates
- Method:
- Candidate itemsets are stored in a hash-tree
- Leaf node of hash-tree contains a list of itemsets and counts
- Interior node contains a hash table
- Subset function: finds all the candidates contained in a transaction

Example of Generating Candidates

- L3={abc, abd, acd, ace, bcd}
- Self-joining: L3*L3
- abcd from abc and abd
- acde from acd and ace
- Pruning:
- acde is removed because ade is not in L3
- C4={abcd}

3,6,9

1,4,7

2,5,8

2 3 4

5 6 7

3 6 7

3 6 8

1 4 5

3 5 6

3 5 7

6 8 9

3 4 5

1 3 6

1 2 4

4 5 7

1 2 5

4 5 8

1 5 9

Example: Counting Supports of CandidatesTransaction: 1 2 3 5 6

1 + 2 3 5 6

1 3 + 5 6

1 2 + 3 5 6

Generating Strong Association Rules

- Confidence(A B) = Prob(B|A)

= support(A B)/support(A)

- Example:

L3={2,3,5}

2^3 5, confidence=2/2=100%

2^5 3, confidence=2/3=67%

3^5 2, confidence=2/2=100%

2 3^5, confidence=2/3=67%

3 2^5, confidence=2/3=67%

5 3^2, confidence=2/3=67%

Efficient Implementation of Apriori in SQL

- S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and Implications. In SIGMOD’98
- Implementations based on pure SQL-92
- Impossible to get good performance out of pure SQL based approaches alone
- Make use of object-relational extensions like UDFs, BLOBs, Table functions etc.
- Get orders of magnitude improvement

Improvements of Apriori

- General ideas
- Scan the transaction database as fewer passes as possible
- Reduce number of candidates
- Facilitate support counting of candidates

DIC: Reduce Number of Scans

- S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97
- Basic idea
- Count the itemsets at the boundary in a lattice
- Push the boundary dynamically
- Using trie structure to keep track counters and reordering items to reduce counting costs

Once all (k-1)-itemset of a k-itemset are all frequent, the counting of the k-itemset can begin

Any upper nodes of an infrequent itemset should not be counted

1-itemsets

2-itemsets

1-itemsets

…

Example of DICABCD

ABC

ABD

ACD

BCD

AB

AC

BC

AD

BD

CD

Transactions

B

C

D

A

Apriori

{}

2-items

Itemset lattice and boundary

DIC

3-items

DIC: Pros and Cons

- Number of scans
- Can be reduced in some cases
- But how about non-homogeneous data and high support situations?
- Item reordering
- “Item reordering did not work as well as we had hoped”
- Performance
- 30% gain at low support ends
- 30% lose at high support ends

DHP: Reduce the Number of Candidates

- J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95
- Major features
- Efficient generation for candidate itemsets
- Effective reduction on transaction database size

DHP: Efficient Generation for Candidates

- In the k pass, count support for k-candidates, entries in hash table
- A (k+1)-itemset in Lk*Lk is qualified as a (k+1)-candidate only if it passes the hash filtering, i.e., it is hashed into a hash entry whose value is no less than support threshold
- Example
- Candidates: a, b, c, d, e
- Hash entries: {ab, ad, ae} {bd, be, de} …
- Frequent 1-itemset: a, b, d, e
- ab is not a candidate 2-itemset if the count of the hash bucket, {ab, ad, ae}, is below support threshold

DHP: Effective Reduction on Database Size

- An item in transaction t can be trimmed if it does not appear in at least k of the candidate k-itemsets in t
- Examples
- Transaction acd can be discarded if only ac is frequent
- Transaction bce must be kept if bc, be, and cd are frequent

Partition: Scan Database Only Twice

- A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95
- Mine all frequent itemsets by scanning transaction database only twice

Scan One in Partition

- Divide database into n partitions.
- A global frequent itemset must be frequent in at least one partition.
- Process one partition in main memory at a time, for each partition
- generate local frequent itemsets using the Apriori algorithm
- also form tidlist for all itemsets to facilitate counting in the merge phase
- tidlist: contains the transaction Ids of all transactions that contain the itemset within a given partition

Scan Two in Partition

- Merge local frequent itemsets to generate a set of all potential large itemsets
- Count actual supports
- Support can be computed from the tidlists

Partition: Pros and Cons

- Achieve both CPU and I/O improvements over Apriori
- The number of distinct local frequent itemsets may be very large
- tidlists to be maintained can be huge

Sampling for Mining Frequent Itemsets

- H. Toivonen. Sampling large databases for association rules. In VLDB’96
- Select a sample of original database, mine frequent itemsets within sample using Apriori
- Scan database once to verify frequent itemsets found in sample, only bordersof closure of frequent itemsets are checked
- Example: check abcd instead of ab, ac, …, etc.
- Scan database again to find missed frequent itemsets

Challenges for the Sampling Method

- How to sample a large database?
- When support threshold is pretty low, sampling may not generate results good enough

Incremental Association Mining

- A transaction database and a set of frequent itemset already mined
- A set of update transactions for transaction database, including insertion and deletion
- How to update the frequent itemset for the updated transaction database?

Frequent

itemsets

What are the updated

frequent itemsets?

Transaction database

Update

transactions

FUP: Incremental Update of Discovered Rules

- D. Cheung, J. Han, V. Ng, and C. Wong. Maintenance of discovered association rules in large databases: An incremental updating technique. In ICDE’96
- View a database: original DB È incremental db.
- A k-itemset (for any k)
- frequent in DB È db if frequent in both DB and db.
- infrequent in DB È db if also in both DB and db.
- For those only frequent in DB, merge corresponding counts in db.
- For those only frequent in db, search DB to update their itemset counts.

Incremental Update of Discovered Rules

- A fast updating algorithm, FUP (Cheung et al.’96)
- View a database: original DB È incremental db.
- A k-itemset (for any k),
- frequent in DB È db if frequent in both DB and db.
- infrequent in DB È db if also in both DB and db.
- For those only frequent in DB, merge corresponding counts in db.
- For those only frequent in db, search DB to update their itemset counts.
- Similar methods can be adopted for data removal and update, or distributed/parallel mining.

Parallel and Distributed Association Mining

- D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast distributed algorithm for mining association rules. In PDIS 1996
- M. Tamura and M. Kitsuregawa. Dynamic Load Balancing for Parallel Association Rule Mining on Heterogenous PC Cluster Systems. In VLDB 1999
- E. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In SIGMOD’97
- M. Zaki, S. Parthasarathy, and M. Ogihara. Parallel algorithms for discovery of association rules. In Data Mining and Knowledge Discovery. Vol.1 No.4, 1997

Interestingness Measures

- Objective measures

Two popular measurements:

- support; and
- confidence
- Subjective measures (Silberschatz & Tuzhilin, KDD95)

A rule (pattern) is interesting if

- it is unexpected (surprising to the user); and/or
- actionable (the user can do something with it)

Criticism to Support and Confidence

- Example 1: (Aggarwal & Yu, PODS98)
- Among 5000 students
- 3000 play basketball
- 3750 eat cereal
- 2000 both play basket ball and eat cereal
- play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.
- play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

Criticism to Support and Confidence (Cont.)

- Example 2:
- X and Y: positively correlated,
- X and Z, negatively related
- support and confidence of

X=>Z dominates

Other Interestingness Measures: Interest

- Interest (lift)
- taking both P(A) and P(B) in consideration
- P(A^B)=P(B)*P(A), if A and B are independent events
- A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated.

Other Interestingness Measures: Conviction

- Conviction
- from implication: A B A ( B)
- factors in both P(A) and P(B) and has value 1 when the relevant items are completely unrelated (confidence does not)
- rules which hold 100% of the time have the highest possible value (interest does not)

Collective Strength

- Collective strength is a number between 0 and with 1 as the break-even point

where v(I) is the violation ratio of itemset I. An itemset is said to be in violationof a transaction if some of the items are present in the transaction, and others are not. v(I) is equal to the fraction of transactions which contain a proper non-null subset of I

- Recasting collective strength as:

Collective Strength (2)

- Let I be a set of items {i1, i2, … ik}. Let pr denote the frequency of the item ir in the database.
- the probability that the itemset I occurs in a transaction is
- the probability that none of the items in I occurs in the transaction is
- the expected fraction of transactions that contains at least one item in I, and where at least one item is absent:

Collective Strength (3)

- Example:
- Collective Strength of I {X,Y}:

Summary

- Frequent pattern mining is an important data mining task
- Apriori is an important frequent pattern mining methodology
- A set of Apriori-like mining methods have been developed since 1994
- Interestingness measure is important at discovery interesting rules

Jiawei Han

Intelligent Database Systems Research Lab.

Simon Fraser University, Canada

http://www.cs.sfu.ca/~han

Tutorial Outline

- What is frequent pattern mining?
- Frequent pattern mining algorithms
- Apriori and its variations
- A multi-dimensional view of frequent pattern mining
- Constraint-based frequent pattern mining
- Recent progress on efficient mining methods
- Mining frequent patterns without candidate generation
- CLOSET: Efficient mining of frequent closet itemsets
- FreeSpan: Towards efficient sequential pattern mining

Part IIIA multi-dimensional view of frequent pattern mining

- Multi-level association
- Multi-dimensional association
- Distance-based association

bread

milk

2%

white

wheat

skim

Foremost

Sunset

Why Multiple-Level Association Rules- Items often form hierarchy
- Difficult to find strong association rules at primitive level.
- High levelrules often lead to prior knowledge and expectations.
- Different users have different needs
- It is desirable to mine multi-level association rules.

bread

milk

2%

white

wheat

skim

Foremost

Sunset

Mining Multi-level Associations- A top-down, progressive deepening approach:
- First find high-level strong rules:
- milk bread [20%, 60%]
- Then find their lower-level weaker rules:
- 2% milk wheat bread[6%,50%]

Multi-level Association Mining: Using Reduced Support

- Controlled Level-cross filtering by single item
- Introduce Level passing threshold
- Children whose parents pass this threshold are examined.
- Subfrequent items
- Nodes that meet the passing threshold but not their own level threshold.
- Children of these nodes to be examined.
- Two thresholds for each levels!

Checking for Redundant Rules

- Some rules may be redundant due to “ancestor” relationships between items.
- Example
- milk wheat bread, [support = 8%, confidence = 70%]
- 2% milk wheat bread, [support = 2%, confidence = 72%]
- We say the first rule is an ancestor of the second rule.
- A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor.

Mining Multi-Dimensional Associations

- Single-dimensional rules (Transaction based)
- Involve a single distinct predicate with >1 occurrences.
- Example: buys(X, “milk”) buys(X, “bread”)

- Multi-dimensional rules (Relational Database)
- Involve 2 or more dimensions or predicates
- Inter-dimension association rules (no repeated predicates)
- age(X,”19-25”) occupation(X,“student”) buys(X,“coke”)
- hybrid-dimension association rules (repeated predicates)
- age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)

Multi-Dimensional Associations

- Categorical Attributes
- have finite number of possible values
- no ordering among values
- Quantitative Attributes
- numeric
- implicit ordering among values

Multi-dimensional association rules often include both types of attributes!

Mining Multi-Dimensional Associations

- How to mine multi-dimensional associations:
- Search for frequent predicatesets.
- k-predicateset:
- A set of k conjunctive predicates.
- Example:{age, occupation, buys} is a 3-predicateset.
- Techniques can be categorized by how quantitative attributes such as age are treated.

Three Techniques for Mining MD Associations

1. Static discretization of quantitative attributes

- Quantitative attributes are statically discretized by using predefined concept hierarchies.

2. Quantitative association rules

- Quantitative attributes are dynamically discretized into “bins”based on the distribution of the data.

3. Distance-based association rules

- This is a dynamic discretization process that considers the distance between data points.

(age)

(income)

(buys)

(age, income)

(age,buys)

(income,buys)

(age,income,buys)

Static Discretization of Quantitative Attributes- Discretized prior to mining using predefined concept hierarchy.
- Numeric values are replaced by ranges.
- On Relational Database, finding all frequent k-predicatesets will require k or k+1 table scans.
- Data cube is well suited for mining.
- The cells of an n-dimensional cuboid correspond to the predicatesets.

Quantitative Association Rules

- Numeric attributes are dynamically discretized
- Such that the confidence or compactness of the rules mined is maximized.
- We will discuss 2-D quantitative association rules: Aquan1 Aquan2 Acat

age(X,”30-34”)income(X,”24K - 48K”)buys(X,”high resolution TV”)

Techniques for Mining MD Associations

- ARCS (Association Rule Clustering System)
- Cluster “adjacent” association rules to form general rules using a 2-D grid.
- Example:

age(X,”30-34”)income(X,”24K - 48K”)buys(X,”high resolution TV”)

Techniques for Mining MD Associations

How does ARCS work?

1. Binning

2. Find frequent predicateset

3. Clustering

4. Optimize

Limitations of ARCS

- Only quantitative attributes on LHS of rules.
- Only 2 attributes on LHS. (2D limitation)
- An alternative to ARCS
- Non-grid-based
- equi-depth binning
- clustering based on a measure of partial completeness.
- “Mining Quantitative Association Rules in Large Relational Tables” by R. Srikant and R. Agrawal.

Binning methods do not capture the semantics of interval data

Distance-based partitioning, more meaningful discretization considering:

density/number of points in an interval

“closeness” of points in an interval

Mining Distance-based Association RulesClusters and Distance Measurements

- S[X] is a set of N tuples t1, t2, …, tN , projected on the attribute set X
- The diameter of S[X]:
- distx:distance metric, e.g. Euclidean distance or Manhattan

Clusters and Distance Measurements(Cont.)

- The diameter, d, assesses the density of a cluster CX , where
- Finding clusters and distance-based rules
- the density threshold, d0 , replaces the notion of support
- modified version of the BIRCH clustering algorithm

Part IVConstraint-based frequent pattern mining

- The Apriori algorithm
- Improvements of Apriori
- Incremental, parallel, and distributed methods

Constraint-Based Mining

- Interactive, exploratory mining giga-bytes of data?
- Could it be real? --- Making good use of constraints!
- What kinds of constraints can be used in mining?
- Knowledge type constraint: classification, association, etc.
- Data constraint: SQL-like queries
- Find product pairs sold together in Vancouver in Dec.’98.
- Dimension/level constraints:
- in relevance to region, price, brand, customer category.
- Rule constraints:
- small sales (price < $10) triggers big sales (sum > $200).
- Interestingness constraints:
- strong rules: min_support 3%, min_confidence 60%.

Example of Constrained Association Query

- Constrained Association Query Expressed in DMQL

mine associations as

lives(C, _ ,”Vancouver”) and sales+(C, ?{I},{S}) sales+(C,?{J},{T})

from sales

where S.year = 1998 and T.year = 1998 and I.category = J.category

group by C, I.category

having sum(I.price) < 100 and min (J.price)>=500

with min_support = 0.01 and min_confidence = 0.5

- Possible Association Rule Found

lives(C, _, “Vancouver”) and sales(C, “Census_CD”,) and sales (C, “MS/Office97”, _) sales(C,”MS/SQLServer”,_) [0.015;0.68]

Two Kinds of Rule Constraints

- Rule form constraints: meta-rule guided mining.
- P(x, y) ^ Q(x, w) ® takes(x, “database systems”)
- Rule (content) constraint: constraint-based query optimization (SIGMOD’98).
- sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000

Metarule-guided Mining of Association rules

- Metarule example:
- P( x , y ) ^ income( x , w ) buys(x, “educational software”)

P : Predicate Variable instantiated to attributes

x : Variable representing a customer

y, w : Value variables assigned to P and the attribute income

- Instantiated to concrete values in the mining
- age(x, “35-45”) ^ income(x, “40-60K”) buys(x, “educational software”)

Metarule-guided Mining (Cont.)

- Metarule - Rule template of the form :
- P1 ^ P2 ^ … ^ Pl Q1 ^ Q2 ^ … ^ Qr

Pi (i = 1, …, l) and Qj (j=1,…,r) can be either instantiated predicates or predicate variables

- Typical case of Multidimensional association rules mining
- Data cubes are well-suited for this task
- Abridged n-D cube search approach

only p-D ( p=l+r ) and l-D cuboids need to be examined, rather than entire n-D cube

Rule (content) Constraints

- 1-variable vs. 2-variable constraints (SIGMOD’99):
- 1-variable: A constraint confining only one side (L/R) of the rule,
- sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000 e.g., as shown above.
- 2-variable: A constraint confining both sides (L and R).
- sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)

Constrained Association Query

- Database: (1) trans (TID, Itemset ),(2)itemInfo (Item, Type, Price)
- A constrained asso. query (CAQ) is in the form of {(S1, S2)|C },
- where C is a set of constraints on S1, S2including frequency constraint
- A classification of (single-variable) constraints:
- Class constraint: S A. e.g. S Item
- Domain constraint:
- S v, { , , , , , }. e.g. S.Price < 100
- v S, is or . e.g. snacks S.Type
- V S, or S V, { , , , , }
- e.g. {snacks, sodas } S.Type, S.Type{snacks, sodas}=
- Aggregation constraint: agg(S) v, where agg is in {min, max, sum, count, avg}, and { , , , , , }.
- e.g. count(S1.Type) 1 , avg(S2.Price) 100

Constrained Association Query Optimization Problem

- Given a CAQ = { (S1, S2) | C }, the algorithm should be :
- sound: It only finds frequent sets that satisfy the given constraints C
- complete: All frequent sets satisfy the given constraints C are found

Constrained Association Query Optimization Problem (Cont.)

- A Naïve Solution :
- Apply Apriori for finding all frequent sets, and then to test them for constraint satisfaction one by one.

--inefficient and sometimes prohibitively expensive

- Exploratory Mining and Pruning Approach (Algorithm CAP):
- Analysis of the properties of constraints: try to push them as deeply as possible inside the frequent set computation.
- Eliminate irrelevant item sets earlier, minimize the number of item sets to be examined

Property of Constraints: Anti-Monotone

- Anti-monotonicity (Downward closed): If a set S violates the constraint, any superset of S violates the constraint.
- Examples:
- min(S.Price) 100 is anti-monotone
- min(S.Price) 100 is not anti-monotone
- min(S.Price) = 100 is partly anti-monotone
- Application:
- Push “sum(S.price) 1000” deeply into iterative frequent set computation.

Characterization of Anti-Monotonicity Constraints

S v, { , , }

v S

S V

S V

S V

min(S) v

min(S) v

min(S) v

max(S) v

max(S) v

max(S) v

count(S) v

count(S) v

count(S) v

sum(S) v

sum(S) v

sum(S) v

avg(S) v, { , , }

(frequent constraint)

Yes

no

no

yes

partly

no

yes

partly

yes

no

partly

yes

no

partly

yes

no

partly

no

(yes)

Property of Constraints: Succinctness

- Succinctness:
- For any set S1 and S2 satisfying C, S1 S2 satisfies C
- Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1 , i.e., it contains a subset belongs to A1 ,
- Example :
- avg(I.Price ) 100 is not succinct
- min(J.Price ) 500 is succinct

expressed in the form S1 S2:

- S1 which is a subset of the set of all items<$500
- S2: Subset of the set of all items with price>$500

Property of Constraints: Succinctness(Cont.)

- Once-and-for-all vs Generate-and-test Paradigm
- Apriori-like algorithm following Generate-and-test Paradigm
- Enumerate all and only those sets that are guaranteed to satisfy the constraint
- Optimization:
- If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint alone is not affected by the iterative support counting.

Characterization of Constraints by Succinctness

S v, { , , }

v S

S V

S V

S V

min(S) v

min(S) v

min(S) v

max(S) v

max(S) v

max(S) v

count(S) v

count(S) v

count(S) v

sum(S) v

sum(S) v

sum(S) v

avg(S) v, { , , }

(frequent constraint)

Yes

yes

yes

yes

yes

yes

yes

yes

yes

yes

yes

weakly

weakly

weakly

no

no

no

no

(no)

Algorithm and Performance

- Algorithm CAP (SIGMOD’98):
- if a constraint is both succinct and anti-monotone

then replace C1 in Apriori by constraint-set (push in)

- if a constraint is succinct but not anti-monotone

then generate constrained test set

- if a constraint is anti-monotone but not succinct

then constraint is pushed in before counting is done

- If none of the above

then induce any weaker constraint C’ and use 1-3

- Performance comparison: ~ 80 times faster
- victim: Apriori+ (first computing frequent set, then enforcing constraints)

Constraint-based mining with frequent pattern growth

- Constraint-based frequent pattern mining: view data mining as mining query optimization problem
- Classification of constraints: Our SIGMOD'98 paper: anti-monotone and succinct constraints
- Further classification: monotone, anti-monotone, succinct, convertible, inconvertible
- E.g., avg(S) < v, variance (S) > v are convertible constraints
- Constraint-based mining with frequent pattern growth
- J. Pei and J. Han "Can We Push More Constraints into Frequent Pattern Mining?", Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining (KDD'00), August 2000.

Jiawei Han

Intelligent Database Systems Research Lab.

Simon Fraser University, Canada

http://www.cs.sfu.ca/~han

Tutorial Outline

- What is frequent pattern mining?
- Frequent pattern mining algorithms
- Apriori and its variations
- A multi-dimensional view of frequent pattern mining
- Constraint-based frequent pattern mining
- Recent progress on efficient mining methods
- Mining frequent patterns without candidate generation
- CLOSET: Efficient mining of frequent closet itemsets
- FreeSpan: Towards efficient sequential pattern mining

Spatial Association

FIND SPATIAL ASSOCIATION RULE DESCRIBING "Golf Course"

FROM Washington_Golf_courses, Washington

WHERE CLOSE_TO(Washington_Golf_courses.Obj, Washington.Obj, "3 km")

AND Washington.CFCC <> "D81"

IN RELEVANCE TO Washington_Golf_courses.Obj, Washington.Obj, CFCC

SET SUPPORT THRESHOLD 0.5

Spatial Associations & Hierarchy of Spatial Relationships

- Spatial association: Association relationship containing spatial predicates, e.g., close_to, intersect, contains, etc.
- Topological relations:
- intersects, overlaps, disjoint, etc.
- Spatial orientations:
- left_of, west_of, under, etc.
- Distance information:
- close_to, within_distance, etc.
- Hierarchy of spatial relationship:
- “g_close_to”: near_by, touch, intersect, contain, etc.
- First search for rough relationship and then refine it.

Example: Spatial Association Rule Mining

- “What kinds of spatial objects are close to each other in B.C.?”
- Kinds of objects: cities, water, forests, usa_boundary, mines, etc.
- Rules mined:
- is_a(x, large_town) ^ intersect(x, highway) ® adjacent_to(x, water). [7%, 85%]
- is_a(x, large_town) ^adjacent_to(x, georgia_strait) ® close_to(x, u.s.a.). [1%, 78%]
- Mining method: Apriori + multi-level, multi-dimensional association + geo-spatial algorithms (from rough to high precision: multi-resolution, multi-granularity) + constraints (constraint-based mining).

Progressive Refinement Mining of Spatial Association Rules

- Hierarchy of spatial relationship:
- “g_close_to”: near_by, touch, intersect, contain, etc.
- First search for rough relationship and then refine it.
- Two-step mining of spatial association:
- Step 1: rough spatial computation (as a filter)
- Using MBR or R-tree for rough estimation.
- Step2: Detailed spatial algorithm (as refinement)
- Apply only to those objects which have passed the rough spatial association test (no less than min_support)

Automatic Extraction of Image Content Features

Allows Search

by image content

like colors,

textures, etc.

Window Colors

and locales

Color Layout

Texture

Color Histogram

Thumbnails

Content-Based Search

Multimedia Database

Multimedia

Data mining

Keywords and

Descriptions

Visual

Color(X, color)

Size(X, size)

Texture(X, texture)

Shape(X, shape)

H-next-to(X,Y)

V-next-to(X,Y)

Overlap(X,Y)

Include(X,Y)

Movement

Motion(X, motion)

Speed(X, speed)

Location

Vertical(X, v)

Horizontal(X, h)

Locales and their Features

Mining Multi-Media Association Rules

- Associate color, theme, location (relationship), texture, or even moving information, with multimedia objects.
- Need # of occurrences instead of Boolean existence.
- Two red square and one blue circle implies theme “air-show”
- Need spatial relationships
- Blue on top of white squared object are associated with brown bottom.
- Need multi-resolution and progressive refinement mining
- It is expensive to explore detailed associations among objects at high resolution
- It is crucial to ensure the completeness of search at multi-resolution space.

Different Resolution Hierarchy

Spatial Relationships from Layout

Property P1ontop-of Property P2

Property P1next-to Property P2

From Coarse to Fine Resolution Mining

Progressively mine finer resolutions only on candidate frequent item-sets

Progressive Resolution Refinement

Feature

Localization

Minimum bounding

circles

Tile Size

i=0; D0 =D;

while (i<maxResLevel) do {

Ri= {sufficiently frequent item-sets at res i}

i=i+1; Di=Filter(Di-1, Ri-1);

}

Coarse

resolution

Fine

resolution

Jiawei Han

Intelligent Database Systems Research Lab.

Simon Fraser University, Canada

http://www.cs.sfu.ca/~han

- What is frequent pattern mining?
- Frequent pattern mining algorithms
- Apriori and its variations
- A multi-dimensional view of frequent pattern mining
- Constraint-based frequent pattern mining
- Recent progress on efficient mining methods
- Mining frequent patterns without candidate generation
- CLOSET: Efficient mining of frequent closet itemsets
- FreeSpan: Towards efficient sequential pattern mining

Based on

Jian Pei, Jiawei Han and Yinwen Yin,

“Mining Frequent Patterns Without Candidate Generation”,

Proc. ACM SIGMOD’2000, May 2000

Apriori :A“light-house” for mining frequent-patterns

- Subsequent studies on association mining
- partitioning (’95), sampling (’96), dynamic itemset counting (‘97), tree-projection (’99), incremental and parallel algorithms, etc.
- Aproiri: a “light-house” for mining other frequent-pattern based knowledge
- Generalized, multi-level, quantitative, clustering, distance-based association
- Sequential patterns, temporal or cyclic associations, partial periodicity
- Correlation and causality

Is Apriori Fast Enough? — Performance Bottlenecks

- The core of the Apriori algorithm:
- Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets
- Use database scan and pattern matching to collect counts for the candidate itemsets
- The bottleneck of Apriori: candidate generation
- Huge candidate sets:
- 104 frequent 1-itemset will generate 107 candidate 2-itemsets
- To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates.
- Multiple scans of database:
- Needs (n +1 ) scans, n is the length of the longest pattern

Our Approach: Mining Frequent PatternsWithout Candidate Generation

- Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure
- highly condensed, but complete for frequent pattern mining
- avoid costly database scans
- Develop an efficient, FP-tree-based frequent pattern mining method
- A divide-and-conquer methodology: decompose mining tasks into smaller ones
- Avoid candidate generation: sub-database test only!

Header Table

Item frequency head

f 4

c 4

a 3

b 3

m 3

p 3

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

How to Construct FP-tree from a Transactional Database?TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p}{f, c, a, m, p}

200 {a, b, c, f, l, m, o}{f, c, a, b, m}

300 {b, f, h, j, o}{f, b}

400 {b, c, k, s, p}{c, b, p}

500{a, f, c, e, l, p, m, n}{f, c, a, m, p}

min_support = 0.5

Steps:

- Scan DB once, find frequent 1-itemset (single item pattern)
- Order frequent items in frequency descending order
- Scan DB again, construct FP-tree

Benefits of the FP-tree Structure

- Completeness:
- never breaks a long pattern of any transaction
- preserves complete information for frequent pattern mining
- Compactness
- reduce irrelevant information—infrequent items are gone
- frequency descending ordering: more frequent items are more likely to be shared
- never be larger than the original database (if not count node-links and counts)
- Example: For Connect-4 DB, compression ratio could be over 100

Mining Frequent Patterns Using FP-tree

- General idea (divide-and-conquer)
- Recursively grow frequent pattern path using the FP-tree
- Method
- For each item, construct its conditional pattern-base, and then its conditional FP-tree
- Repeat the process on each newly created conditional FP-tree
- Until the resulting FP-tree is empty, or it contains only one path(single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)

Major Steps to Mine FP-tree

- Construct conditional pattern base for each node in the FP-tree
- Construct conditional FP-tree from each conditional pattern-base
- Recursively mine conditional FP-trees and grow frequent patterns obtained so far
- If the conditional FP-tree contains a single path, simply enumerate all the patterns

Header Table

Item frequency head

f 4

c 4

a 3

b 3

m 3

p 3

f:4

c:1

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

Step 1: From FP-tree to Conditional Pattern Base- Starting at the frequent header table in the FP-tree
- Traverse the FP-tree by following the link of each frequent item
- Accumulate all of transformed prefix paths of that item to form a conditional pattern base

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

Properties of FP-tree for Conditional Pattern Base Construction

- Node-link property
- For any frequent item ai,all the possible frequent patterns that contain ai can be obtained by following ai's node-links, starting from ai's head in the FP-tree header
- Prefix path property
- To calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be accumulated, and its frequency count should carry the same count as node ai.

f:3

c:3

a:3

m-conditional FP-tree

Step 2: Construct Conditional FP-tree- For each pattern-base
- Accumulate the count for each item in the base
- Construct the FP-tree for the frequent items of the pattern base

{}

m-conditional pattern base:

fca:2, fcab:1

Header Table

Item frequency head

f 4

c 4

a 3

b 3

m 3

p 3

f:4

c:1

All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

c:3

b:1

b:1

a:3

p:1

m:2

b:1

p:2

m:1

Conditional pattern-base

Conditional FP-tree

p

{(fcam:2), (cb:1)}

{(c:3)}|p

m

{(fca:2), (fcab:1)}

{(f:3, c:3, a:3)}|m

b

{(fca:1), (f:1), (c:1)}

Empty

a

{(fc:3)}

{(f:3, c:3)}|a

c

{(f:3)}

{(f:3)}|c

f

Empty

Empty

Mining Frequent Patterns by Creating Conditional Pattern-Basesf:3

c:3

am-conditional FP-tree

{}

f:3

c:3

a:3

m-conditional FP-tree

Step 3: recursively mine the conditional FP-treeCond. pattern base of “am”: (fc:3)

{}

Cond. pattern base of “cm”: (f:3)

f:3

cm-conditional FP-tree

{}

Cond. pattern base of “cam”: (f:3)

f:3

cam-conditional FP-tree

Single FP-tree Path Generation

- Suppose an FP-tree T has a single path P
- The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P

{}

All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

f:3

c:3

a:3

m-conditional FP-tree

Principles of Frequent Pattern Growth

- Pattern growth property
- Let be a frequent itemset in DB, B be 's conditional pattern base, and be an itemset in B. Then is a frequent itemset in DB iff is frequent in B.
- “abcdef ” is a frequent pattern, if and only if
- “abcde ” is a frequent pattern, and
- “f ” is frequent in the set of transactions containing “abcde ”

Why Is Frequent Pattern Growth Fast?

- Our performance study shows
- FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection
- Reasoning
- No candidate generation, no candidate test
- Use compact data structure
- Eliminate repeated database scan
- Basic operation is counting and FP-tree building

FP-growth vs. Apriori: Scalability With the Support Threshold

Data set T25I20D10K

FP-growth vs. Apriori: Scalability With Number of Transactions

Data set T25I20D100K (1.5%)

FP-growth vs. Tree-Projection: Scalability with Support Threshold

Data set T25I20D100K

FP-growth vs. Tree-Projection: Scalability with Number of Transactions

Data set T25I20D100K (1%)

Scaling FP-growth in Large Databases: Partition Projection

- Construction of FP-trees for projected databases
- first partition a database into a set of projected DBs
- then construct an FP-tree and mine it for each projected DB
- Parallel projection vs. partition projection
- Parallel projection: project each transaction in parallel to the corresponding projected databases
- Partition projection: project each transaction to only one projected database

Parallel Projection

Tran. DB

fcamp

fcabm

fb

cbp

fcamp

p-proj DB

fcam

cb

fcam

m-proj DB

fcab

fca

fca

b-proj DB

fca

f

c

a-proj DB

fc

fc

fc

c-proj DB

f

f

f

f-proj DB

am-proj DB

fc

fc

fc

cm-proj DB

f

f

f

…

Projected databases

Conditional pattern bases

p

{fcam, cb, fcam}

{fcam:2, cb:1}

m

{fca, fcab, fca}

{fca:2, fcab:1}

b

{fca, f, c}

{fca:1, f:1, c:1}

a

{fc, fc, fc}

{fc:3}

c

{f, f, f}

{f:3}

f

Empty

Empty

Projected Database Vs. Conditional Pattern BasesPartition Projection

Tran. DB

fcamp

fcabm

fb

cbp

fcamp

p-proj DB

fcam

cb

fcam

m-proj DB

fcab

fca

fca

b-proj DB

f

…

a-proj DB

fc

…

c-proj DB

f

…

f-proj DB

…

am-proj DB

fc

fc

fc

cm-proj DB

f

f

f

…

Scaling FP-growth in Large Databases: Other Techniques

- Construction of a disk-resident FP-tree
- Use B+ tree structure: page header: shared prefix path
- Mining in group accessing mode: exhaust node-links on the same pages before swapping
- Node-link free FP-trees: project prefix sub-paths of all the nodes into corresponding conditional pattern-base
- Materialization of an FP-tree
- Incremental updates of an FP-tree

Summary: Mining frequent patterns without candidate generation

- Apriori, a popular frequent pattern mining algorithm, represents one mining methodology
- An interesting alternative is mining frequent patterns without candidate generation
- With compressed FP-tree structure, and partitioned database mining methodology, the performance of frequent pattern mining can be improved substantially

Based on

Jian Pei, Jiawei Han and Runying Mao,

“CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”,

Proc. ACM SIGMOD’2000 Data Mining Workshop, May 2000

Mining Frequent Itemsets

- Mining frequent itemsets is essential for many data mining tasks, e.g. association, etc.
- Problem:
- Often generates a large number of frequent itemsets and rules
- Hard to understand or select rules
- Harm efficiency

Frequent closed itemsets

a:5, b:5, c:5, ab:5, ac:5, bc:5, abc:5

abc:5

abcd:3

abcd:3

abcde:2, abcdf:2, abcdef:2

abcdef:2 (maxpattern)

What Is Frequent Closed Itemset?Support threshold = 2

Tran. DB

abc

abc

abcd

abcdef

abcdef

Rules: abcd (sup=3, conf=60%)

abcdef (sup=2, conf=66%)

abcdef (sup=2, conf=40%)

is redundant!

Why Mining Frequent Closed Itemsets?

- Mining frequent closed itemsets
- Has the same power as mining the complete set of frequent itemsets,
- Substantially reduces redundant rules to be generated
- First proposed by Pasquier et al. in ICDT’99 (Also, Information Systems, Vol.24, No.1, 1999)

min_sup=1

min_conf=50%

TDB

(a1a2…a100)

(a1a2…a50)

2100-1frequent itemsets

a1, …, a100, a1a2, …, a99a100,

…, a1a2…a100

A tremendous number of

association rules!

2frequent closed itemsets

a1a2…a100, a1a2…a50

1 rule

a1a2…a50a51a52…a100

How to Mine Frequent Closed Itemsets?

- A-Close [PBTL99]
- Using the A-priori framework
- Pruning redundancies in candidates
- Post-processing to generate complete but non-duplicate result
- ChARM [ZaHs00]
- Exploring a vertical data format
- Finding frequent closet itemsets by computing intersections of sets of transaction ids for itemsets
- CLOSET: our method presented here

Items

10

a, c, d, e, f

20

a, b, e

30

c, e, f

40

a, c, d, f

50

c, e, f

How CLOSET Works? An ExampleStep 1. Find frequent items

min_sup =2

List of frequent items in support descending order

f_list=<c:4, e:4, f:4, a:3, d:2>

Items

10

a, c, d, e, f

20

a, b, e

30

c, e, f

40

a, c, d, f

50

c, e, f

Divide Search Space- All frequent closed itemsets can be divided into 5 non-overlap subsets based on f_lsit
- The ones containing d
- The ones containing a but no d
- The ones containing f but no a nor d
- The ones containing e but no f, a nor d
- The ones containing only c

f_list=<c:4, e:4, f:4, a:3, d:2>

cefad

ea

cef

cfad

cef

f_list:<c:4, e:4, f:4, a:3, d:2>

TDB|d (d:2)

cefa

cfa

TDB|a (a:3)

cef

e

cf

TDB|f (f:4)

ce:3

c

TDB|e (e:4)

c:3

F.C.I.: e:4

F.C.I.: cfad:2

F.C.I.: cf:4, cef:3

F.C.I.: a:3

TDB|ea (ea:2)

c

F.C.I.: ea:2

Find Frequent Closed Itemsets Containing dLocal frequent items: c, f, a

Every transaction having d also contains c, f and a

Find Frequent Closed Itemsets Containing a but No d

Frequent closed itemsets containing a but no d can be further partitioned into subsets

Ones having af but no d

Ones having ae but no d nor f

Ones having ac but no d, e nor f

TDB

cefad

ea

cef

cfad

cef

f_list:<c:4, e:4, f:4, a:3, d:2>

TDB|d (d:2)

cefa

cfa

TDB|a (a:3)

cef

e

cf

TDB|f (f:4)

ce:3

c

TDB|e (e:4)

c:3

F.C.I.: e:4

F.C.I.: cfad:2

F.C.I.: cf:4, cef:3

F.C.I.: a:3

sup(fa)=sup(ca)=sup(cfad)

No FCI having fa or ca but no d

TDB|ea (ea:2)

c

F.C.I.: ea:2

Finding Frequent Closed Itemsets

- Other subsets of frequent closed itemsets can be found similarly
- In summary, the set of frequent closed itemsets is {acdf:2, a:3, ae:2, cf:4, cef:3, e:4}

Optimization Techniques

- Using FP-tree structure:
- Compress transactional & conditional databases
- If support of itemset Y = # of trans. in X-conditional database

Then XY is a potential frequent closed itemset

- Extract frequent closed itemsets from single prefix-path of FP-tree (see next)
- Prune based on sub-pattern match

Optimization: Handle Single Prefix Path

- Benefits
- Directly identify frequent closed itemsets
- Reduce the size of the remaining FP-tree to be examined
- Reduce the levels of recursions

root

a:7

abc:7

b:7

abcd:5

c:7

d:5

e:4

abcdef:4

f:4

Scaling CLOSET in Large Database

TDB

cefad

ea

cef

cfad

cef

- Using projected databases
- Partition-based projection

f_list:<c:4, e:4, f:4, a:3, d:2>

TDB|d (d:2)

cefa

cfa

TDB|a (a:3)

cef

e

cf

TDB|f (f:4)

ce:3

c

TDB|e (e:4)

c:3

F.C.I.: e:4

F.C.I.: cfad:2

F.C.I.: cf:4, cef:3

F.C.I.: a:3

TDB|ea (ea:2)

c

F.C.I.: ea:2

Performance Study

- Test takers
- A-Close
- ChARM
- CLOSET
- Datasets
- Synthetic dataset T25I20D100k with 10k items
- Connect-4
- Pumsb

#FCI

#FI

#FI/#FCI

64179 (95%)

812

2205

2.72

60801 (90%)

3486

27127

7.78

54046 (80%)

15107

533975

35.35

47290 (70%)

35875

4129839

115.12

Compactness of Frequent Closed Itemsets- Example: Dataset Connect-4

Mining Max-patterns

- Max-pattern: frequent patterns without proper frequent super pattern
- MaxMiner approach: Looking ahead
- R. Bayato. Efficiently mining long patterns from databases. In SIGMOD’98
- In kth pass, check candidate k-itemsets, and the longest patterns potentially frequent
- Meet difficulties when frequent item set is large and in hybrid situations

Example of MaxMiner

- Suppose a, b, c, d are the frequent 1-items
- In the 2nd scan, check the counts of
- ab, ac, ad, abcd
- bc, bd, bcd
- cd
- Benefits
- Once a long pattern is found, no counting for sub-pattern is needed
- But …
- If a1, a2, …, a1000are frequent 1-items, what will speculating on counting a1a2…a1000 return?

They are called

Candidate groups

Summary: CLOSET and MaxSet

- CLOSET is an FP-tree-based database projection method for efficient mining of frequent closed itemsets in large databases
- Applying FP-tree structure
- Developing techniques to identify frequent closed itemsets quickly
- Exploring a partition-based projection mechanism for scalable mining
- CLOSET can be extended straightforwardly to mine max-patterns

Based onJ. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu, "FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining", Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining (KDD'00), Boston, MA, August 2000.

Seq. ID

Sequence

Elements

10

<(bd)cb(ac)>

20

<(bf)(ce)b(fg)>

30

<(ah)(bf)abf>

40

<(be)(ce)d>

50

<a(bd)bcb(ade)>

What Is Sequential Pattern Mining?- Given a set of sequences, find all of the frequent subsequences

A sequence database

<ad(ae)> is a subsequence of <a(bd)bcb(ade)>

Given support threshold min_sup =2, <(bd)cb> is a sequential pattern

Why Sequential Pattern Mining?

- An important data mining research problem with broad applications
- Analysis of customer purchase patterns
- DNA analysis
- Scientific processes
- Consequences of natural disasters
- Web log mining
- Etc.

How to Mine Sequential Patterns — Conventional Apriori -Like Methods

- Apriori heuristic: Any super sequence of a infrequent one can never be frequent!
- Apriori –like method: a multi-pass, candidate-generation-and-test approach
- Candidate generation: generate length-kcandidate sequences from length-(k-1 ) sequential patterns
- A length-k sequence becomes a candidate iff all of its length-(k-1 ) subsequences are frequent
- Test: test the support of candidate sequences by scanning the sequence database

Sequence

10

<(bd)cb(ac)>

20

<(bf)(ce)b(fg)>

30

<(ah)(bf)abf>

40

<(be)(ce)d>

50

<a(bd)bcb(ade)>

GSP: an Apriori -Like Method- The number of scans is at least the maximum length of seq. patterns
- Though benefits from Apriori pruning, still generates a good number of candidates
- Some candidates do not appear in the database at all

Cand. cannot pass sup. threshold

5th scan: 1 cand. 1 length-5 seq. pat.

<(bd)cba>

Cand. not in DB at all

<abba> <(bd)bc> …

4th scan: 8 cand. 6 length-4 seq. pat.

3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all

<abb> <aab> <aba> <baa><bab> …

2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>

1st scan: 8 cand. 6 length-1 seq. pat.

<a> <b> <c> <d> <e> <f> <g> <h>

Bottlenecks of Apriori –Like Methods

- A huge set of candidates could be generated
- 1,000 frequent length-1 sequences generate length-2 candidates!
- Many scans of database in mining
- Encounter difficulty when mining long sequential patterns
- Exponential number of short candidates
- A length-100 sequential pattern needs candidate sequences!

Can we extend FP-growth to sequential pattern mining?

- Frequent pattern tree and FP-growth (SIGMOD'2000):
- A successful algorithm for mining frequent (unordered) itemsets
- Can we extend FP-growth to sequential pattern mining?
- A straightforward construction of sequential-pattern tree does not work well.
- A level-by-level project does not achieve high performance either
- An interesting method is to explore alternative-level projection
- Our performance study shows very good performance on it.
- J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu, "FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining", Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining(KDD'00), August 2000.

FreeSpan: Frequent Pattern-projected Sequential Pattern Mining

- A divide-and-conquer approach
- Recursively project a sequence database into a set of smaller databases based on the current set of frequent patterns
- Mine each projected database to find its patterns
- Two alternatives of database projections
- Level-by-level projection
- Alternative-level projection

Divide-and-Conquer: Mapping Into Projected Sequence Databases

- Find frequent items from database
- List of frequent items in support descending order is called f_list
- All sequential patterns can be divided into several subsets without overlap

Sequence Database SDB

< (bd) c b (ac) >

< (bf) (ce) b (fg) >

< (ah) (bf) a b f >

< (be) (ce) d >

< a (bd) b c b (ade) >

f_list: b:5, c:4, a:3, d:3, e:3, f:2

All seq. pat. can be divided into 6 subsets:

- Seq. pat. containing item f
- Those containing e but no f
- Those containing d but no e nor f
- Those containing a but no d, e or f
- Those containing c but no a, d, e or f
- Those containing only item b

Mine Sequential Patterns Using Projected Databases

- The complete set of sequential patterns containing item i but no items following i in f_list can be found in the i-projected database
- A sequence s is projected as si to the i-projected database if there is at least an item i in s
- si is a copy of s by removing from s all the infrequent items and any frequent item j following i in f_list
- Example: <(ah)(bf)abf> is projected to f-projected database as <a(bf)abf>, and to a-projected database as <abab>, and to b-projected database as <bb>

Parallel vs. Partition Projection

- Parallel projection
- Scan database once, form all projected dbs at a time
- May derive many and rather large projected dbs if sequence on average contains many frequent items
- Partition projection
- Project a sequence to the projected database of the last frequent item in it
- “Propagate” sequences on-the-fly

Example of Database Projection

Seq. DB

…

< (bf) (ce) b (fg) >

< (ah) (bf) a b f >

…

Seq. DB

…

< (bf) (ce) b (fg) >

< (ah) (bf) a b f >

…

f

f

e

e

d

a

d

a

…

…

f-proj db

<(bf)(ce)bf>

<a(bf)abf>

d-proj db

…

f-proj db

<(bf)(ce)bf>

<a(bf)abf>

d-proj db

…

a-proj db

…

<abab>

a-proj db

…

<abab>

e-proj db

<b(ce)b>

…

e-proj db

…

<b(ce)b>

Partition projection

Parallel projection

< (bd) c b (ac) >

< (bf) (ce) b (fg) >

< (ah) (bf) a b f >

< (be) (ce) d >

< a (bd) b c b (ade) >

Freq. items: b:5, c:4, a:3, d:3, e:3, f:2

Seq. patterns: <b>, <c>, <a>, <d>, <e>, <f>

f-projected database

< (bf) (ce) b f >

< a (bf) a b f >

Freq. items: b, f

Seq. patterns: <bf>, <fb>, <(bf)>, <ff>, <(bf)f>, <fbf>

One more scan for {b, b, f}: <(bf)b>, <bbf>, <(bf)bf>

Mining Sequential Patterns by Database Projection- When projecting the f-projected DB
- Find local freq. items other than f
- Seq. pat. of multiple f’s
- Scan f-proj. db once
- Find seq. pat. containing {f, b}, including multiple f’s
- Count multiple b’s
- One more scan finds seq. pat. containing two b’s and one f

c

b

c-projected database

f

a

a-projected database

e

d

d-projected database

e-projected database

b-projected database

Mining by Level-by-level Projected Databases

- Algorithm
- Scan database once, find frequent items and get f_list
- Recursively do database projection level by level
- Pros and cons
- Benefits: only need to find frequent items in each projected database, instead of exploring candidate sequence generation
- The number of combinations is much less than their possible combinations
- Cost: partition and projection of databases
- Works well in sparse databases

Mining by Alternative Level Projected Databases

- Postpone the generation of projected databases, take each database as a level-shared, combined projected database
- Algorithm
- Scan database, find freq. items and get f_list
- Recursively do alternative-level projection
- Construct frequent item matrix
- Generate length-2 sequential patterns and annotations on item repeating patterns and projected databases
- Scan database to generate item-repeating patterns and projected databases

Frequent Item Matrix

- A triangular matrix F[j, k], where 1<=j<=m and 1<=k<=j, m is the number of frequent items
- F[j, j] has only one counter, recording the appearance of sequence <jj>
- F[j,k] has 3 counters (A, B, C)
- A records patterns <jk>
- B records patterns <kj>
- C records patterns <(jk)>

4

c

(4, 3, 0)

1

a

(3, 2, 0)

(2, 1, 1)

2

d

(2, 2, 2)

(2, 2, 0)

(1, 2, 1)

1

e

(3, 1, 1)

(1, 1, 2)

(1, 0, 1)

(1, 1, 1)

1

f

(2, 2, 2)

(1, 1, 0)

(1, 1, 0)

(0, 0, 0)

(1, 1, 0)

2

b

c

a

d

e

f

Frequent Item Matrix: ExampleSequence Database SDB

< (bd) c b (ac) >

< (bf) (ce) b (fg) >

< (ah) (bf) a b f >

< (be) (ce) d >

< a (bd) b c b (ade) >

4

c

(4, 3, 0)

1

a

(3, 2, 0)

(2, 1, 1)

2

d

(2, 2, 2)

(2, 2, 0)

(1, 2, 1)

1

e

(3, 1, 1)

(1, 1, 2)

(1, 0, 1)

(1, 1, 1)

1

f

(2, 2, 2)

(1, 1, 0)

(1, 1, 0)

(0, 0, 0)

(1, 1, 0)

2

b

c

a

d

e

f

Generate Length-2 Sequential Patterns- For each counter, if the value in the counter is no less than min_sup, output the corresponding sequential pattern

Sequence Database SDB

< (bd) c b (ac) >

< (bf) (ce) b (fg) >

< (ah) (bf) ab f >

< (be) (ce) d >

< a (bd) b c b (ade) >

Generate <ba>:3, <ab>:2

Generating Annotations on Item-repeating Patterns

- For row j
- If f[j, j]>=min_sup, generate <jj+>
- The count of <jjj>, <jjjj>, … should be registered in the next round
- For a column i<>j, if f[i, i]>=min_sup, generate i+
- There are potentially more than one i appearing in the sequential pattern
- If f[j, j]>=min_sup, generate j+
- If only one of the three counters of f[i, j] is frequent, sequence is used as the annotation; Otherwise, set is used
- Enhance string filtering

4

c

(4, 3, 0)

1

a

(3, 2, 0)

(2, 1, 1)

2

d

(2, 2, 2)

(2, 2, 0)

(1, 2, 1)

1

e

(3, 1, 1)

(1, 1, 2)

(1, 0, 1)

(1, 1, 1)

1

f

(2, 2, 2)

(1, 1, 0)

(1, 1, 0)

(0, 0, 0)

(1, 1, 0)

2

b

c

a

d

e

f

Generating Annotation on Item-repeating Patterns: ExampleSequence Database SDB

< (bd) c b (ac) >

< (bf) (ce) b (fg) >

< (ah) (bf) ab f >

< (be) (ce) d >

< a (bd) b c b (ade) >

Generate <b+e>

Generate {b+f+}

4

c

(4, 3, 0)

1

a

(3, 2, 0)

(2, 1, 1)

2

d

(2, 2, 2)

(2, 2, 0)

(1, 2, 1)

1

e

(3, 1, 1)

(1, 1, 2)

(1, 0, 1)

(1, 1, 1)

1

f

(2, 2, 2)

(1, 1, 0)

(1, 1, 0)

(0, 0, 0)

(1, 1, 0)

2

b

c

a

d

e

f

Generating Annotations on Projected Databases- For row j
- For each i<j, if f[i, j], f[j, k] and f[i, k](k<i) may form a pattern generating triple, k should be added to i’s projected column set
- If there is a choice between sequence or set, sequence is preferred

Generate <(ce)>:{b} indicating

Generating <(ce)>-projected database with {b} as the only item included

Length-2 seq. pat.

Ann. on rep. Items

Ann. on proj. DBs

f

<bf>:2, <fb>:2, <(bf)>:2

<b+f+>

None

e

<be>:3, <(ce)>:2

<b+e>

<(ce)>:{b}

d

<bd>:2, <db>:2, <(bd)>:2, <cd>:2, <dc>:2, <da>:2

{b+d}, <da+>

<da>:{b,c}, {cd}:{b}

a

<ba>:3, <ab>:2, <ca>:2, <aa:2>

<aa+>, {a+b+}, <ca+>

<ca>:{b}

c

<bc:4>, <cb>:3

{b+c}

None

b

4

b

<bb>:4

<bb+>

None

c

(4, 3, 0)

1

a

(3, 2, 0)

(2, 1, 1)

2

d

(2, 2, 2)

(2, 2, 0)

(1, 2, 1)

1

e

(3, 1, 1)

(1, 1, 2)

(1, 0, 1)

(1, 1, 1)

1

f

(2, 2, 2)

(1, 1, 0)

(1, 1, 0)

(0, 0, 0)

(1, 1, 0)

2

b

c

a

d

e

f

Generate Length-2 Patterns and AnnotationsSeq. Database SDB

< (bd) c b (ac) >

< (bf) (ce) b (fg) >

< (ah) (bf) ab f >

< (be) (ce) d >

< a (bd) b c b (ade) >

<(cd)>:{b}

<da>:{b, c}

{cd}:{b}

<ca>:{b}

Proj. DB

<b(cd)b>, <b(ce)>

<(bd)cb(ac)>, <(bd)bcba>

<(bd)cbc>, <bcd>, <(bd)bcbd>

<bcba>, <bbcba>

Seq. Pat.

<bce>:2

<(bd)a>:2, <dca>:2, <dba>:2, <(bd)ca>:2, <(bd)ba>:2, <dcba>:2, <(bd)cba>:2

<bcd>:2, <(bc)c>:2, <dcb>:2, <(bd)cb>:2, <(bd)bc>:2

<bca>:2, <cba>:2, <bcba>:2

b

4

c

(4, 3, 0)

1

a

(3, 2, 0)

(2, 1, 1)

2

d

(2, 2, 2)

(2, 2, 0)

(1, 2, 1)

1

e

(3, 1, 1)

(1, 1, 2)

(1, 0, 1)

(1, 1, 1)

1

f

(2, 2, 2)

(1, 1, 0)

(1, 1, 0)

(0, 0, 0)

(1, 1, 0)

2

b

c

a

d

e

f

Projected Databases and Sequential PatternsSeq. Database SDB

< (bd) c b (ac) >

< (bf) (ce) b (fg) >

< (ah) (bf) ab f >

< (be) (ce) d >

< a (bd) b c b (ade) >

Performance Study

- Data sets: generated by the procedure described in [sa96b]
- #Items: 10,000
- Test takers
- GSP
- Improved GSP: using pattern growth to find length-2 sequential patterns
- Freespan1: level-by-level projection
- FreeSpan: alternative level projection

Scalability With the Number of Sequences

- FreeSpan and improved GSP are tested
- Both algorithms are linearly scalable
- FreeSpan is much better

Why FreeSpan Outperforms Apriori-like Methods?

- Projects a large sequence database recursively into a set of small projected sequence databases based on the currently mined frequent sets
- The alternatively-level projection in FreeSpan reduces the cost of scanning multiple projected databases and takes advantages of Apriori -like 3-way candidate filtering

Summary: FreeSpan

- FreeSpan: An interesting, scalable and efficient sequential pattern mining method
- Related work and future directions
- Mining other kinds of time-related knowledge in the spirit of FreeSpan
- Efficiently mining long sequential patterns, such as DNA analysis
- Constraint-based sequential pattern mining

Research Issues in Frequent Pattern Mining

- FP-tree based mining method
- Quantitative association analysis
- Collaborative filtering using fascicles and FP-tree based method
- Max-patterns, multi-level, multi-dimensional frequent pattern mining methods
- Sequential pattern mining methods
- Mining partial periodicity and max-sequential patterns in the spirit of FreeSpan
- DNA sequential pattern mining
- Constraint-based sequential pattern mining

References

- R. Agarwal, C. Aggarwal and V.V.V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing, (to appear), 2000
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. VLDB’94, Chile, September 1994
- R. Agrawal and R. Srikant “Mining sequential patterns”, In Proc. ICDE’95, Taiwan, March 1995.
- R.J. Bayardo. Efficiently mining long patterns from databases. In Proc. SIGMOD’98, WA, June 1998
- S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97
- D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast distributed algorithm for mining association rules. In PDIS’96
- D. Cheung, J. Han, V. Ng, and C. Wong. Maintenance of discovered association rules in large databases: An incremental updating technique. In ICDE’96

References(con’t)

- E. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In SIGMOD’97
- J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
- J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu, "FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining", In Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining (KDD'00), Boston, MA, August 2000.
- J. Han, J. Pei, and Y. Yin: “Mining frequent patterns without candidate generation”. In ACM-SIGMOD’2000, Dallas, TX, May 2000.
- H. Mannila, H. Toivonen and A.I. Verkamo. Efficient algorithms for discovering association rules. In Proc. KDD’94, WA, July 1994
- H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery, Taylor and Francis, 2000.
- J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95

References

- N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. ICDT’99, Israel, January 1999.
- Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal: Efficient Mining of Association Rules Using Closed Itemset Lattices. In Information Systems, Vol.24, No.1, 1999
- J. Pei and J. Han "Can We Push More Constraints into Frequent Pattern Mining?”, In Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining (KDD'00), Boston, MA, August 2000.
- J. Pei, J. Han, and R. Mao, "CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", Proc. 2000 ACM-SIGMOD Int. Workshop on Data Mining and Knowledge Discovery (DMKD'00), Dallas, TX, May 2000.
- A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95
- R. Srikant and R. Agrawal “Mining sequential patterns: Generations and performance improvements”, In Proc. EDBT’96, France, March 1996.

References

- M. Tamura and M. Kitsuregawa. Dynamic Load Balancing for Parallel Association Rule Mining on Heterogenous PC Cluster Systems. In VLDB 1999
- H. Toivonen. Sampling large databases for association rules. In VLDB’96
- M.J. Zaki and C. Hsiao. ChARM: An efficient algorithm for closed association rule mining. In Tech. Rep. 99-10, Computer Science, Rensselaer Polytechnic Institute, 1999.
- M. Zaki, S. Parthasarathy, and M. Ogihara. Parallel algorithms for discovery of association rules. In Data Mining and Knowledge Discovery Vol. 1. No.4, 1997
- O. R. Za"iane, J. Han, and H. Zhu, "Mining Recurrent Items in Multimedia with Progressive Resolution Refinement", Proc. 2000 Int. Conf. Data Engineering (ICDE'00), San Diego, CA, March 2000.

http://db.cs.sfu.ca/

Thank you !!!

Download Presentation

Connecting to Server..