PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jian Pei, Jiawei Han, BehzadMortazavi-Asl, Helen Pinto, Qiming Chen, UmeshwarDayal, Mei-Chun Hsu 17th ICDE’01 • 組員：沈郁棋簡志佳吳永斌曾文彥沈家譽

Outline • Introduction • FreeSpan • PrefixSpan algorithm • Improvement of PrefixSpan • Experimental results and performance • Conclusion • References • Discussions

Introduction • What is a sequence and sequential pattern mining? • Applications of sequential pattern mining • Previous method: • GSP algorithm • FreeSpan algorithm • New method proposed in this paper: • PrefixSpan

Sequence and subsequence • A sequence is an ordered list of itemset • Each sequence consists of a list elements and each element consists of a set of items • Example: α = <a(abc)(ac)d(cf)> is a sequence • An element may contain an item or many items. Items within an element are unordered and we list them alphabetically. Elements Item

Sequence and subsequence (Cont.) • If α = <a(abc)(ac)d(cf)> is a sequence, then β = <a(bc)df> is a subsequence of α. • The number of items in a sequence is the length of the sequence. For example, the length of α is 9. Transaction database Sequence database v.s.

Sequential Pattern Mining • Proposed by R. Agrawal and R. Srikant in 1995 • Given a user-specified minimum support threshold, sequential pattern mining is to find all of the frequent subsequences. • Applications: • Analyses of customer purchase behavior • Web access pattern • Scientific experiments • Disease treatments • Natural disasters • DNA sequences

SID • Sequence • 10 • <(bd)cb(ac)> • 20 • <(bf)(ce)b(fg)> • 30 • <(ah)(bf)abf> • 40 • <(be)(ce)d> • 50 • <a(bd)bcb(ade)> GSP Algorithm • An Apriori-like method • Example: Seed set 1st scan: 8 cand. 6 length-1 seq. pat. <a> <c> <d> <e> <f> <g> <h> 2nd scan: 51 cand. 19 length-2 seq. pat. <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> Seed set <abb> <aab> <aba> <baa> <bab> … 3rd scan: 46 cand. 19 length-3 seq. pat. min_sup=2

Disadvantages of GSP • Potentially huge set of candidate sequences • Apriori-based method may generate a large set of candidate sequences which contains all the possible permutations of the elements. • For example, if there are 1000 frequent sequences of length-1, , the number of candidates will be Derived from the set Derived from the set ,

Disadvantages of GSP (Cont.) • Multiple scans of databases • The length of each candidate sequence grows by one at each database scan. • To find <(abc) (abc) (abc) (abc) (abc)>, GSP must scan the database at least 15 times.

Disadvantages of GSP (Cont.) • Difficulties at mining long sequential patterns • There is only a single sequence of length 100, min_sup=1 • length-1 candidate sequences : • length-2 candidate sequences : • length-3 candidate sequences : • Total

FreeSpan : Frequent Pattern-Projected Sequential Pattern Mining • A divide-and-conquer approach • Recursively project a sequence database into a set of smaller databases based on the current set of frequent patterns. • Mining each projected database to find its patterns.

FreeSpan • finding f_list: • finding the set of frequent items, and they are listed in support descending order. Items: a,b,c,d,e,f,g (item, support): (a,4),(b,4),(c,4),(d,3),(e,3),(f,3),(g,1) Support threshold:2 f_list = a:4, b:4, c:4, d:3, e:3, f:3 SID: sequence_id

FreeSpan f_list = a:4, b:4, c:4, d:3, e:3, f:3 The complete set of sequential patterns in sequence database can be divided into 6 disjoint subsets: The ones containing only the item a. The ones containing item b but containing no items after b in f_list. The ones containing item c but containing no items after c in f_list. The ones containing item d but containing no items after d in f_list. The ones containing item e but containing no items after e in f_list. The ones containing item f.

FreeSpan Finding sequential patterns containing only item a. By scanning sequence database once, the only two sequential patterns are found: <a>, <aa> <aa> support:2= frequent threshold: 2 <a> =<(a)> support:4> frequent threshold: 2 <aaa> support:1<= frequent threshold: 2

FreeSpan Finding sequential patterns containing item b but no item after b in f_list. This can be achieved by constructing the {b}-projected database. Finding sequences in sequence database containing item b. <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc> Removing all items after b in f_list(a:4, b:4, c:4, d:3, e:3, f:3). <a(ab)(a)> <(a)(b)(a)> <(ab)b> <(a)b> <a(ab)a> <aba> <(ab)b> <ab>

FreeSpan By scanning the projected base once more, all sequential patterns containing item b but no item after b in f_list are found: {b}-projected database <a(ab)a> <aba> <(ab)b> <ab> , <ab>, <ba>, <(ab)> Support=3 And then, finding other subsets of sequential patterns.

FreeSpan • FreeSpanis more efficient than GSP. • The major cost of FreeSpan is to deal with projected databases. If a pattern appears in each sequence of adatabase, its projected database does not shrink . {f}-projected database f_list = a:4, b:4, c:4, d:3, e:3, f:3 <a(abc)(ac)d(cf)> <(ef)(ab)(df)cb> <e(af)cbc>

PrefixSpan (Prefix-Projected Sequential Pattern Growth)

Prefix • Given two sequences α=<a1a2…an> and β=<b1b2…bm>, m≤n. • Sequence β is called a prefix of α if and only if: • bi= ai for i ≤ m-1; • bm ⊆ am; • Example : • α =<a(abc)(ac)d(cf)> • β =<a(abc)a> α=<a1a2…am-1 am…..an> β=<b1b2…bm-1bm >

Postfix (Projection) • Let α’ =<a1,a2…an> be the projection of α w.r.t. prefix • β=<a1a2…am-1a’m> (m ≤n) • Sequence γ=<a’’mam+1…an> is called the postfix of α w.r.t. prefix β, denoted as γ= α/ β, where a’’m=(am - a’m). • We also denote α =β⋅ γ. • Example: • α’ =<a(abc)(ac)d(cf)>, • β =<a(abc)a>, • γ=<(_c)d(cf)>.

PrefixSpan – Algorithm • Input of the algorithm : A sequence database S, and the minimum support threshold min_support. • Output of the algorithm: The complete set of sequential patterns. • Subroutine: PrefixSpan(α, L, S|α). • Parameters: • α: sequential pattern, • L: the length of α; • S|α: the α-projected database, if α ≠<>; otherwise; the sequence database S. • Call PrefixSpan(<>,0,S).

PrefixSpan - Example • Step 1: Find length-1 sequential patterns. (min_sup=2) length-1 sequential patterns : <a><c><d><e><f>

Step 2: divide search space. Length-1 sequential patterns <a>, , <c>, <d>, <e>, <f> Having prefix <c>, …, <f> Length-2 sequential patterns <aa>:2, <ab>:4, <(ab)>:2, <ac>:4, <ad>:2, <af>:2 … … … … … Support=1 < min_sup <aaa> is not sequential pattern Terminate!!

Efficiency of PrefixSpan • No candidate sequence needs to be generated • Projected databases keep shrinking • Major cost of PrefixSpan: constructing projected databases -> Can be improved by bi-level projections

Improvement of PrefixSpan

Bi-level Projection • The major cost of PrefixSpan is to construct projected databases. • A bi-level projection scheme is proposed to reduce the number and the size of projected databases.

Bi-level Projection - Example • Scan S to find the length-1 sequential patterns:<a>, , <c>, <d>, <e>, <f>.

Bi-level Projection - Example • Instead of constructing projected databases for each length-1 sequential pattern, we construct a 6*6 lower triangular matrix M, as shown in Table 3. Sequence <cc> appears in three sequences in S.

Bi-level Projection - Example • Instead of constructing projected databases for each length-1 sequential pattern, we construct a 6*6 lower triangular matrix M, as shown in Table 3. supports(<ac>) = 4, supports(<ca>) = 2 and supports(<(ac)>) = 1.

Bi-level Projection - Example • The <ab>-projected database contains three sequences: <(_c)(ac)(cf)>, <(_c)a> and <c>. • Three frequent items are found: <a>, <c> and <(_c)>. Length-2 pattern <(_c)a> can be generated

Bi-level Projection - Example • Using level-by-levelprojection, to find the complete set of 53 sequential patterns, 53 projected databases are constructed. • Only 22 projected databases are constructed by bi-level projection.

Optimization “do we need to include every item in a postfix in the projected databases?”

Optimization • Consider the <ac>-projected database… Exclude item d from <ac>-projected database!

Pseudo-Projection • By examining a set of projected databases, one can observe that postfixes of a sequence often appear repeatedly in recursive projected databases. • Sequence <a(abc)(ac)d(cf)> has postfixes <(abc)(ac)d(cf)> and <(_c)(ac)d(cf)> as projections in <a>- and <ab>-projected databases, respectively.

Pseudo-Projection • Every projection consists of two pieces of information: pointer to the sequence in database and offset of the postfix in the sequence.

Pseudo-Projection • For example, suppose the sequence database S in Table 1 can be held in main memory. • When constructing <a>-projected database, the projection of sequence s1 = <a(abc)(ac)d(cf)> consists two pieces: a pointer to s1 and offset set to 2, i.e., postfix (abc)(ac)d. • <ab>-projected database contains a pointer to s1 and offset set to 4.

Experimental Results and Performance

Dataset • The number of items is set to 1000. • There are 10000 sequences in the data set. • The average number of items within elements is set to 8. • The average number of elements in a sequence is set to 8.

Performance GSP：The GSP algorithm FreeSpan：FreeSpan with alternative level projection PrefixSpan-1：PrefixSpan withlevel-by-level projection PrefixSpan-2：PrefixSpan withbi-level projection • Both FreeSpan and PrefixSpan win GSP • PrefixSpan methods are more efficient than FreeSpan • When the support threshold is low, PrefixSpan-1 requires a major effort to generate projected databases. • The performance curves of PrefixSpan-1 and PrefixSpan-2 are close when support threshold is not low Figure 1

Performance (Cont.) • This figure shows that using pseudo-projections for the projected databases that can be held in main memory improves efficiency of PrefixSpan further. • pseudo-projection improves performance when the projected database can be held in main memory • A related question becomes: • “Can such a method be extended to disk-based processing?” Figure 2

Performance (Cont.) • This figure shows the I/O costs of PrefixSpan-1 and PrefixSpan-2 as well as of their pseudo-projection variations • Dataset • The number of items is set to 1000. • There are 1 million sequences in the data set. • The average number of items within elements is set to 8. • The average number of elements in a sequence is set to 8. Figure 3

Performance (Cont.) • This figure shows the scalability of PrefixSpan-1 and PrefixSpan-2 with respect to the number of sequences. • Both are linearly scalable • Since the support threshold is set to 0.20%, PrefixSpan-2 performs better. Figure 4

Conclusions • PrefixSpanis efficient pattern growth method. • The performance of PrefixSpan is batter then GSP and FreeSpan. • Prefix-projection reduces the size of projected database and leads to efficient processing. • Bi-level projection and pseudo-projection improve sequential pattern mining efficiency.

Reference • [1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), pages 487–499, Santiago, Chile, Sept. 1994. • [2] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 1995 Int. Conf. Data Engineering (ICDE’95), pages 3–14, Taipei, Taiwan, Mar. 1995. • [3] C. Bettini, X. S. Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. Data Engineering Bulletin, 21:32–38, 1998. • [4] M. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern mining with regular expression constraints. In Proc. 1999 Int. Conf. Very Large Data Bases (VLDB’99), pages 223–234, Edinburgh, UK, Sept. 1999. • [5] J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. In Proc. 1999 Int. Conf. Data Engineering (ICDE’99), pages 106–115, Sydney, Australia, Apr. 1999. • [6] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. Freespan: Frequent pattern-projected sequential pattern mining. In Proc. 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD’00), pages 355–359, Boston, MA, Aug. 2000. • [7] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’00), pages 1– 12, Dallas, TX, May 2000. • [8] H. Lu, J. Han, and L. Feng. Stock movement and ndimensional inter-transaction association rules. In Proc. 1998 SIGMOD Workshop Research Issues on Data Mining and Knowledge Discovery (DMKD’98), pages 12:1–12:7, Seattle, WA, June 1998. • [9] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259–289, 1997. • [10] B. O¨ zden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. In Proc. 1998 Int. Conf. Data Engineering (ICDE’98), pages 412–421, Orlando, FL, Feb. 1998. • [11] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proc. 5th Int. Conf. Extending Database Technology (EDBT’96), pages 3–17, Avignon, France, Mar. 1996.

Discussions • The database is static. If any data is added for mining, there will be some problems when doing PrefixSpan. • If the database is dynamic, it will be too large. We have to consider old data while mining patterns. • PrefixSpan cannot take duration into account for an item.

Discussions • The database is static. If any data is added for mining, there will be some problems when doing PrefixSpan. • Incremental database, data steam • If the database is dynamic, it will be too large. We have to consider old data while mining patterns. • Progressive database • PrefixSpancannot take duration into account for an item. • Time interval database

PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

Presentation Transcript

Sequential imperfect-information games Case study: Poker

Phonology: The Sound Pattern of Language

Analyze Projected Job Growth in Public Safety

Web Mining

Data Mining using Fractals and Power laws

LINEAR SEQUENTIAL MACHINE AND REDUCTION OF LINEAR SEQUENTIAL MACHINE

Data Mining using Fractals and Power laws

Mining text and data on chemicals

Design Patterns

Monte F. Hancock, Jr. Chief Scientist Celestech, Inc.

Mining Billion-node Graphs: Patterns, Generators and Tools

Design Patterns

CHAPTER 5 Synchronous Sequential Logic

Nursing Health Assessments

Design Patterns for Parallel Programming

Design patterns

Spatial Data Mining

Data Mining: Concepts and Techniques

Patterns for the People