Mining Sequential Patterns: Generalizations and Performance Improvements

Mining Sequential Patterns:Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by: M.H. Lin

Outline • Motivation • Objective • Introduction • Problem Statement • The New Algorithm: GSP • Performance Evaluation • Conclusion • Personal Opinion

Motivation • The problem of mining sequential patterns was recently introduced. • Limitations of the AprioriAll [Agrawal, 1995] • Absence of time constraints • Rigid definition of a transaction • Absence of taxonomies

Objective • We present GSP, a new algorithm that discovers these generalized sequential patterns • Empirically compared the performance of GSP with the AprioriAll algorithm.

Introduction • Instance • A database of sequences, called data-sequences • Each sequence is a list of transactions ordered by transaction-time • Each transaction is a set of items • Definitions: • A sequential pattern consists a list of itemsets • Support:the number of data-sequences that contain the pattern • Problem: • To discover all the sequential patterns with a user-specified minimum support

Example Of A Sequential Pattern • Database of book-club, each data-sequence corresponds to a given customer’s all book selection, each transaction contains the books selected by the given customer in one order • A sequential pattern: 5% of customers bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’

Features of A Sequential Pattern • E.g: 5% cust. bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’ • The Maximum and/or minimum time gaps between adjacent elements. • Eg: the time between buying ‘Foundation’, and then ‘Foundation and Empire’ and ‘Ringworld’ should be within 3 months • A sliding time window over the sequence-pattern elements • E.g.: one week • Mo: BK-a Sa: BK-b Next Su: BK-c ; • This data-sequence supports the pattern “BK-a” and “ BK-b”, then “BK-c” • User-defined Taxonomies • Example  coming soon….

A User-defined Taxonomy • A customer who bought Foundation,then Perfect Spy, would support the following patterns: • Foundation, then Perfect Spy • Asimov, then Perfect Spy • Science Fiction, then Le Carre • …

The Old Algorithm--AprioriAll • A 3-phase algorithm • Phase 1: finds all frequent itemsets with min. support • Phase 2: transforms the DB s.t. each transaction only contains the frequent itemsets • Phase 3: finds sequential patterns • Pros. • Can Discover all frequent sequential patterns • Cons. • Computationally expensive: space, time • Not feasible to incorporate sliding windows

Problem Statement • Definitions: • Let I = {i1,i2,…,im} be a set of literals, calleditems • Let T be a directed acyclic graph on the literals. • An itemsetis a non-empty set of items • A sequence is an ordered list of itemsets • We denote a sequence s by <s1s2…sn>, where sj is an itemset. • We denote an element of sequence by (x1,x2,…,xm), where xj is an item. • A sequence <a1a2…an> is a subsequence of another sequence <b1b2…bm> if there exist integers i1<i2<…<in such that a1  bi1 , a2 bi2 , …, an bin. • E.g:<(3)(4,5)(8)> is a subsequence of <(7)(3,8)(9)(4,5,6)(8)> • E.g:<(3)(5)> is not a subsequence of <(3,5)>

Problem Statement(contd.) • A data-sequence contains a sequence s if s is a subsequence of the data-sequence. • Plus taxonomies: • a transaction T contains an item x I if x is in T or x is an ancestor of some item in T. • Plus sliding windows: • A data-sequence d= <d1…dm> contains a sequence s = <s1…sn> if there exist integers l1≤u1<l2≤u2<…<ln ≤un such that • 1. siis contained in , 1 ≤ i ≤ n , and • 2. transaction-time(dui) – transaction-time(dli) ≤window-size , 1 ≤ i ≤ n • Plus time constraints: • 3. transaction-time(dli) - transaction-time(dui-1) > min-gap, 2 ≤ i ≤ n, and • 4. transaction-time(dui) - transaction-time(dli-1) ≤ max-gap, 2 ≤ i ≤ n.

Problem Definition • Input: • Database D : data sequences • Taxonomy T : a DAG, not a tree • User-specified min-gap and max-gap time constraints • A user-specified sliding window size • A user-specified minimum support • Goal: • To find all sequences whose support is greater than the given support

Example • minimum support: 2 data-sequences • With the AprioriAll • <(Ringworld)(Ringworld Engineers)> • Sliding-window of 7 days adds the pattern • <(Foundation, Ringworld)(Ringworld Engineers)> • Max-gap of 30 days • both patterns dropped • Add the taxonomy, no sliding-window or time constraints, one is added • <(Foundation)(Asimov)>

GSP:Basic Structure • Phase 1: makes the first pass over database • To yield all the 1-element frequent sequences • Phase 2: the kth pass: • starts with seed set found in the (k-1)th pass to generate candidate sequences, which has one more item than a seed sequence; • A new pass over D to find the support for these candidate sequences • These frequent candidates become the seed for the next pass • Phase 3: terminates when • no more frequent sequences are found • no candidate sequences are generated

GSP: implementation • Generating Candidates: • To generate as few candidates as possible while maintaining completeness • Counting Candidates: • To determine the candidate sequence’s support • Implementing Taxonomies

Candidate Generation • Definition: • K-sequence : a sequence with k items, • Lk : the set of frequent k-sequences, • Ck : the set of candidate k-sequences • Goal: given the set of all frequent (k-1)-sequences, generate a candidate set of all frequent k-sequences • Algorithm: • Join Phase: joining Lk-1with Lk-1 . s1 can join with s2 if (s1– first item) is the same as (s2 – last item) • Prune Phase: delete candidate sequences that have a contiguous (k-1) subsequence whose support count is less than the minimum support

Candidate Generation: Example • Join phase: • <(1,2)(3)>joins with <(2)(3,4)> => <(1,2)(3,4)> • <(1,2)(3)>joins with <(2)(3)(5)> => <(1,2)(3)(5)> • Prune phase: • <(1,2)(3)(5)> is dropped => <(1)(3)(5)> is not in L3

Counting Candidates • Problem: given a set of candidate sequences C and a data sequence d, find all sequences in C that are contained in d. • Two techniques are used • Hash-tree data structure: to reduce the number of candidates in C that need to be checked. • Transformation the representation of the data-sequences d : to find whether a specific candidate is a subsequence of d efficiently.

Hash-Tree Structure • Purpose: reducing the number of candidates • Leaf node: a list of sequences • Interior node: a hash table • Operations: • Adding candidate sequences to the hash-tree • Finding the candidates contained in a data-sequence • Min-gap • Max-gap • Sliding window size

Representation Transformation • Purpose: to efficiently find the first occurrence of an element • Transform the data sequences into transaction-links, each link is identified by one item • E.g.:max-gap=30,min-gap=5,window-size=0,<(1,2)(3)(4)> • E.g.:window-size:7,find(2,6) after time=20

Implementing Taxonomies • Basic Idea: • to replace each data-sequence d with an “extended sequence” d’, where each transaction di ’ contains all the items in the corresponding transaction di ,as well as all their ancestors. • E.g.:<(Foundation, Ringworld)(Second Foundation)> => <Foundation,Ringworld,Asimov,Niven,Science Fiction)(Second Foundation,Asimov,Science Fiction)> • Optimizations • Pre-compute the ancestors of each item, drop infrequent ancestors before a new pass • Not count patterns with an element that contains an item x and its ancestor y • Problem: redundancy • E.g.

Performance Evaluation • Comparison of GSP and AprioriAll • Result: 2 to 20 times faster • Contributing factors: • Fewer candidates • Directly finding the candidates • Scale-up: • scales linearly with the number of data-sequences • Effects of Time Constraints and Sliding Windows: • there was no performance degradation

Experiment Result

Experiment Result(contd.)

Conclusion • GSP is a Generalized Sequence Mining Algorithm • Discovering all the sequential patterns • Good Customizability • Has been incorporated into IBM’s data mining product

Personal Opinion • Hash-tree Structure: main memory limitation • Multi-pass over the database • Apply GSP to CIS data

Mining Sequential Patterns: Generalizations and Performance Improvements

Mining Sequential Patterns: Generalizations and Performance Improvements

Presentation Transcript

Mining Frequent Patterns without Candidate Generation

Data Mining

Data Mining

Data Mining

Data Mining

Data Mining

Approximate Mining of Consensus Sequential Patterns

Sequential PAttern Mining using A Bitmap Representation

On Frequent Chatters Mining

Patterns, Relationships, and Algebraic Thinking (part 2)

Mining Sequential Patterns

Efficient Data Mining for Path Traversal Patterns

Clustering II

Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions

Data Mining: Concepts and Techniques Mining sequence patterns in transactional databases

Sequential Pattern Mining

KDD Overview

What will be new in HDF5?

Data Mining with DB

Association Rule Mining - MaxMiner

Mining Probabilistically Frequent Sequential Patterns in Uncertain Databases

Sequential PAttern Mining using A Bitmap Representation