1 / 29

Mining Sequential Patterns: Generalizations and Performance Improvements

Mining Sequential Patterns: Generalizations and Performance Improvements. R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by: M.H. Lin. Outline. Motivation Objective Introduction Problem Statement The New Algorithm: GSP Performance Evaluation Conclusion

mcavoy
Download Presentation

Mining Sequential Patterns: Generalizations and Performance Improvements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Sequential Patterns:Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by: M.H. Lin

  2. Outline • Motivation • Objective • Introduction • Problem Statement • The New Algorithm: GSP • Performance Evaluation • Conclusion • Personal Opinion

  3. Motivation • The problem of mining sequential patterns was recently introduced. • Limitations of the AprioriAll [Agrawal, 1995] • Absence of time constraints • Rigid definition of a transaction • Absence of taxonomies

  4. Objective • We present GSP, a new algorithm that discovers these generalized sequential patterns • Empirically compared the performance of GSP with the AprioriAll algorithm.

  5. Introduction • Instance • A database of sequences, called data-sequences • Each sequence is a list of transactions ordered by transaction-time • Each transaction is a set of items • Definitions: • A sequential pattern consists a list of itemsets • Support:the number of data-sequences that contain the pattern • Problem: • To discover all the sequential patterns with a user-specified minimum support

  6. Example Of A Sequential Pattern • Database of book-club, each data-sequence corresponds to a given customer’s all book selection, each transaction contains the books selected by the given customer in one order • A sequential pattern: 5% of customers bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’

  7. Features of A Sequential Pattern • E.g: 5% cust. bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’ • The Maximum and/or minimum time gaps between adjacent elements. • Eg: the time between buying ‘Foundation’, and then ‘Foundation and Empire’ and ‘Ringworld’ should be within 3 months • A sliding time window over the sequence-pattern elements • E.g.: one week • Mo: BK-a Sa: BK-b Next Su: BK-c ; • This data-sequence supports the pattern “BK-a” and “ BK-b”, then “BK-c” • User-defined Taxonomies • Example  coming soon….

  8. A User-defined Taxonomy • A customer who bought Foundation,then Perfect Spy, would support the following patterns: • Foundation, then Perfect Spy • Asimov, then Perfect Spy • Science Fiction, then Le Carre • …

  9. The Old Algorithm--AprioriAll • A 3-phase algorithm • Phase 1: finds all frequent itemsets with min. support • Phase 2: transforms the DB s.t. each transaction only contains the frequent itemsets • Phase 3: finds sequential patterns • Pros. • Can Discover all frequent sequential patterns • Cons. • Computationally expensive: space, time • Not feasible to incorporate sliding windows

  10. Problem Statement • Definitions: • Let I = {i1,i2,…,im} be a set of literals, calleditems • Let T be a directed acyclic graph on the literals. • An itemsetis a non-empty set of items • A sequence is an ordered list of itemsets • We denote a sequence s by <s1s2…sn>, where sj is an itemset. • We denote an element of sequence by (x1,x2,…,xm), where xj is an item. • A sequence <a1a2…an> is a subsequence of another sequence <b1b2…bm> if there exist integers i1<i2<…<in such that a1  bi1 , a2 bi2 , …, an bin. • E.g:<(3)(4,5)(8)> is a subsequence of <(7)(3,8)(9)(4,5,6)(8)> • E.g:<(3)(5)> is not a subsequence of <(3,5)>

  11. Problem Statement(contd.) • A data-sequence contains a sequence s if s is a subsequence of the data-sequence. • Plus taxonomies: • a transaction T contains an item x I if x is in T or x is an ancestor of some item in T. • Plus sliding windows: • A data-sequence d= <d1…dm> contains a sequence s = <s1…sn> if there exist integers l1≤u1<l2≤u2<…<ln ≤un such that • 1. siis contained in , 1 ≤ i ≤ n , and • 2. transaction-time(dui) – transaction-time(dli) ≤window-size , 1 ≤ i ≤ n • Plus time constraints: • 3. transaction-time(dli) - transaction-time(dui-1) > min-gap, 2 ≤ i ≤ n, and • 4. transaction-time(dui) - transaction-time(dli-1) ≤ max-gap, 2 ≤ i ≤ n.

  12. Problem Definition • Input: • Database D : data sequences • Taxonomy T : a DAG, not a tree • User-specified min-gap and max-gap time constraints • A user-specified sliding window size • A user-specified minimum support • Goal: • To find all sequences whose support is greater than the given support

  13. Example • minimum support: 2 data-sequences • With the AprioriAll • <(Ringworld)(Ringworld Engineers)> • Sliding-window of 7 days adds the pattern • <(Foundation, Ringworld)(Ringworld Engineers)> • Max-gap of 30 days • both patterns dropped • Add the taxonomy, no sliding-window or time constraints, one is added • <(Foundation)(Asimov)>

  14. GSP:Basic Structure • Phase 1: makes the first pass over database • To yield all the 1-element frequent sequences • Phase 2: the kth pass: • starts with seed set found in the (k-1)th pass to generate candidate sequences, which has one more item than a seed sequence; • A new pass over D to find the support for these candidate sequences • These frequent candidates become the seed for the next pass • Phase 3: terminates when • no more frequent sequences are found • no candidate sequences are generated

  15. GSP: implementation • Generating Candidates: • To generate as few candidates as possible while maintaining completeness • Counting Candidates: • To determine the candidate sequence’s support • Implementing Taxonomies

  16. Candidate Generation • Definition: • K-sequence : a sequence with k items, • Lk : the set of frequent k-sequences, • Ck : the set of candidate k-sequences • Goal: given the set of all frequent (k-1)-sequences, generate a candidate set of all frequent k-sequences • Algorithm: • Join Phase: joining Lk-1with Lk-1 . s1 can join with s2 if (s1– first item) is the same as (s2 – last item) • Prune Phase: delete candidate sequences that have a contiguous (k-1) subsequence whose support count is less than the minimum support

  17. Candidate Generation: Example • Join phase: • <(1,2)(3)>joins with <(2)(3,4)> => <(1,2)(3,4)> • <(1,2)(3)>joins with <(2)(3)(5)> => <(1,2)(3)(5)> • Prune phase: • <(1,2)(3)(5)> is dropped => <(1)(3)(5)> is not in L3

  18. Counting Candidates • Problem: given a set of candidate sequences C and a data sequence d, find all sequences in C that are contained in d. • Two techniques are used • Hash-tree data structure: to reduce the number of candidates in C that need to be checked. • Transformation the representation of the data-sequences d : to find whether a specific candidate is a subsequence of d efficiently.

  19. Hash-Tree Structure • Purpose: reducing the number of candidates • Leaf node: a list of sequences • Interior node: a hash table • Operations: • Adding candidate sequences to the hash-tree • Finding the candidates contained in a data-sequence • Min-gap • Max-gap • Sliding window size

  20. Representation Transformation • Purpose: to efficiently find the first occurrence of an element • Transform the data sequences into transaction-links, each link is identified by one item • E.g.:max-gap=30,min-gap=5,window-size=0,<(1,2)(3)(4)> • E.g.:window-size:7,find(2,6) after time=20

  21. Implementing Taxonomies • Basic Idea: • to replace each data-sequence d with an “extended sequence” d’, where each transaction di ’ contains all the items in the corresponding transaction di ,as well as all their ancestors. • E.g.:<(Foundation, Ringworld)(Second Foundation)> => <Foundation,Ringworld,Asimov,Niven,Science Fiction)(Second Foundation,Asimov,Science Fiction)> • Optimizations • Pre-compute the ancestors of each item, drop infrequent ancestors before a new pass • Not count patterns with an element that contains an item x and its ancestor y • Problem: redundancy • E.g.

  22. Performance Evaluation • Comparison of GSP and AprioriAll • Result: 2 to 20 times faster • Contributing factors: • Fewer candidates • Directly finding the candidates • Scale-up: • scales linearly with the number of data-sequences • Effects of Time Constraints and Sliding Windows: • there was no performance degradation

  23. Experiment Result

  24. Experiment Result(contd.)

  25. Experiment Result(contd.)

  26. Experiment Result(contd.)

  27. Experiment Result(contd.)

  28. Conclusion • GSP is a Generalized Sequence Mining Algorithm • Discovering all the sequential patterns • Good Customizability • Has been incorporated into IBM’s data mining product

  29. Personal Opinion • Hash-tree Structure: main memory limitation • Multi-pass over the database • Apply GSP to CIS data

More Related