Loading in 2 Seconds...

PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

Loading in 2 Seconds...

- 339 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth' - mahola

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

Jian Pei, Jiawei Han, BehzadMortazavi-Asl, Helen Pinto,

Qiming Chen, UmeshwarDayal, Mei-Chun Hsu

17th ICDE’01

- 組員：沈郁棋 簡志佳吳永斌 曾文彥沈家譽

Outline

- Introduction
- FreeSpan
- PrefixSpan algorithm
- Improvement of PrefixSpan
- Experimental results and performance
- Conclusion
- References
- Discussions

Introduction

- What is a sequence and sequential pattern mining?
- Applications of sequential pattern mining
- Previous method:
- GSP algorithm
- FreeSpan algorithm
- New method proposed in this paper:
- PrefixSpan

Sequence and subsequence

- A sequence is an ordered list of itemset
- Each sequence consists of a list elements and each element consists of a set of items
- Example: α = <a(abc)(ac)d(cf)> is a sequence
- An element may contain an item or many items. Items within an element are unordered and we list them alphabetically.

Elements

Item

Sequence and subsequence (Cont.)

- If α = <a(abc)(ac)d(cf)> is a sequence, then β = <a(bc)df> is a subsequence of α.
- The number of items in a sequence is the length of the sequence. For example, the length of α is 9.

Transaction database

Sequence database

v.s.

Sequential Pattern Mining

- Proposed by R. Agrawal and R. Srikant in 1995
- Given a user-specified minimum support threshold, sequential pattern mining is to find all of the frequent subsequences.
- Applications:
- Analyses of customer purchase behavior
- Web access pattern
- Scientific experiments
- Disease treatments
- Natural disasters
- DNA sequences

- Sequence

- 10

- <(bd)cb(ac)>

- 20

- <(bf)(ce)b(fg)>

- 30

- <(ah)(bf)abf>

- 40

- <(be)(ce)d>

- 50

- <a(bd)bcb(ade)>

- An Apriori-like method
- Example:

Seed set

1st scan: 8 cand. 6 length-1 seq. pat.

<a> <b> <c> <d> <e> <f> <g> <h>

2nd scan: 51 cand. 19 length-2 seq. pat.

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>

Seed set

<abb> <aab> <aba> <baa> <bab> …

3rd scan: 46 cand. 19 length-3 seq. pat.

min_sup=2

Disadvantages of GSP

- Potentially huge set of candidate sequences
- Apriori-based method may generate a large set of candidate sequences which contains all the possible permutations of the elements.
- For example, if there are 1000 frequent sequences of length-1, , the number of candidates will be

Derived from the set

Derived from the set

,

Disadvantages of GSP (Cont.)

- Multiple scans of databases
- The length of each candidate sequence grows by one at each database scan.
- To find <(abc) (abc) (abc) (abc) (abc)>, GSP must scan the database at least 15 times.

Disadvantages of GSP (Cont.)

- Difficulties at mining long sequential patterns
- There is only a single sequence of length 100, min_sup=1
- length-1 candidate sequences :
- length-2 candidate sequences :
- length-3 candidate sequences :
- Total

FreeSpan : Frequent Pattern-Projected Sequential Pattern Mining

- A divide-and-conquer approach
- Recursively project a sequence database into a set of smaller databases based on the current set of frequent patterns.
- Mining each projected database to find its patterns.

- finding f_list:
- finding the set of frequent items, and they are listed in support descending order.

Items: a,b,c,d,e,f,g

(item, support): (a,4),(b,4),(c,4),(d,3),(e,3),(f,3),(g,1)

Support threshold:2

f_list = a:4, b:4, c:4, d:3, e:3, f:3

SID: sequence_id

f_list = a:4, b:4, c:4, d:3, e:3, f:3

The complete set of sequential patterns in sequence database can be divided into 6 disjoint subsets:

The ones containing only the item a.

The ones containing item b but containing no items after b in f_list.

The ones containing item c but containing no items after c in f_list.

The ones containing item d but containing no items after d in f_list.

The ones containing item e but containing no items after e in f_list.

The ones containing item f.

Finding sequential patterns containing only item a.

By scanning sequence database once, the only two

sequential patterns are found: <a>, <aa>

<aa>

support:2= frequent threshold: 2

<a> =<(a)>

support:4> frequent threshold: 2

<aaa>

support:1<= frequent threshold: 2

Finding sequential patterns containing item b but no item after b in f_list.

This can be achieved by constructing the {b}-projected database.

Finding sequences in sequence database containing item b.

<a(abc)(ac)d(cf)>

<(ad)c(bc)(ae)>

<(ef)(ab)(df)cb>

<eg(af)cbc>

Removing all items after b in f_list(a:4, b:4, c:4, d:3, e:3, f:3).

<a(ab)(a)>

<(a)(b)(a)>

<(ab)b>

<(a)b>

<a(ab)a>

<aba>

<(ab)b>

<ab>

By scanning the projected base once more,

all sequential patterns containing item b but no item

after b in f_list are found:

{b}-projected database

<a(ab)a>

<aba>

<(ab)b>

<ab>

<b>, <ab>, <ba>, <(ab)>

Support=3

And then, finding other subsets of sequential patterns.

- FreeSpanis more efficient than GSP.
- The major cost of FreeSpan is to deal with projected databases. If a pattern appears in each sequence of adatabase, its projected database does not shrink .

{f}-projected database

f_list = a:4, b:4, c:4, d:3, e:3, f:3

<a(abc)(ac)d(cf)>

<(ef)(ab)(df)cb>

<e(af)cbc>

(Prefix-Projected Sequential Pattern Growth)

Prefix

- Given two sequences α=<a1a2…an> and β=<b1b2…bm>, m≤n.
- Sequence β is called a prefix of α if and only if:
- bi= ai for i ≤ m-1;
- bm ⊆ am;
- Example :
- α =<a(abc)(ac)d(cf)>
- β =<a(abc)a>

α=<a1a2…am-1 am…..an>

β=<b1b2…bm-1bm >

Postfix (Projection)

- Let α’ =<a1,a2…an> be the projection of α w.r.t. prefix
- β=<a1a2…am-1a’m> (m ≤n)
- Sequence γ=<a’’mam+1…an> is called the postfix of α w.r.t. prefix β, denoted as γ= α/ β, where a’’m=(am - a’m).
- We also denote α =β⋅ γ.
- Example:
- α’ =<a(abc)(ac)d(cf)>,
- β =<a(abc)a>,
- γ=<(_c)d(cf)>.

PrefixSpan – Algorithm

- Input of the algorithm :

A sequence database S, and the minimum support threshold min_support.

- Output of the algorithm:

The complete set of sequential patterns.

- Subroutine: PrefixSpan(α, L, S|α).
- Parameters:
- α: sequential pattern,
- L: the length of α;
- S|α: the α-projected database, if α ≠<>; otherwise; the sequence database S.
- Call PrefixSpan(<>,0,S).

PrefixSpan - Example

- Step 1: Find length-1 sequential patterns. (min_sup=2)

length-1 sequential patterns : <a><b><c><d><e><f>

Length-1 sequential patterns

<a>, <b>, <c>, <d>, <e>, <f>

Having prefix <c>, …, <f>

Length-2 sequential

patterns

<aa>:2, <ab>:4, <(ab)>:2,

<ac>:4, <ad>:2, <af>:2

…

… …

…

…

Support=1 < min_sup

<aaa> is not sequential pattern

Terminate!!

Efficiency of PrefixSpan

- No candidate sequence needs to be generated
- Projected databases keep shrinking
- Major cost of PrefixSpan: constructing projected databases

-> Can be improved by bi-level projections

Bi-level Projection

- The major cost of PrefixSpan is to construct projected databases.
- A bi-level projection scheme is proposed to reduce the number and the size of projected databases.

Bi-level Projection - Example

- Scan S to find the length-1 sequential patterns:<a>, <b>, <c>, <d>, <e>, <f>.

Bi-level Projection - Example

- Instead of constructing projected databases for each length-1 sequential pattern, we construct a 6*6 lower triangular matrix M, as shown in Table 3.

Sequence <cc> appears in three sequences in S.

Bi-level Projection - Example

- Instead of constructing projected databases for each length-1 sequential pattern, we construct a 6*6 lower triangular matrix M, as shown in Table 3.

supports(<ac>) = 4, supports(<ca>) = 2 and supports(<(ac)>) = 1.

Bi-level Projection - Example

- The <ab>-projected database contains three sequences: <(_c)(ac)(cf)>, <(_c)a> and <c>.
- Three frequent items are found: <a>, <c> and <(_c)>.

Length-2 pattern <(_c)a> can be generated

Bi-level Projection - Example

- Using level-by-levelprojection, to find the complete set of 53 sequential patterns, 53 projected databases are constructed.
- Only 22 projected databases are constructed by bi-level projection.

Optimization

“do we need to include every item in a postfix in the projected databases?”

Pseudo-Projection

- By examining a set of projected databases, one can observe that postfixes of a sequence often appear repeatedly in recursive projected databases.
- Sequence <a(abc)(ac)d(cf)> has postfixes <(abc)(ac)d(cf)> and <(_c)(ac)d(cf)> as projections in <a>- and <ab>-projected databases, respectively.

Pseudo-Projection

- Every projection consists of two pieces of information: pointer to the sequence in database and offset of the postfix in the sequence.

Pseudo-Projection

- For example, suppose the sequence database S in Table 1 can be held in main memory.
- When constructing <a>-projected database, the projection of sequence s1 = <a(abc)(ac)d(cf)> consists two pieces: a pointer to s1 and offset set to 2, i.e., postfix (abc)(ac)d.
- <ab>-projected database contains a pointer to s1 and offset set to 4.

Dataset

- The number of items is set to 1000.
- There are 10000 sequences in the data set.
- The average number of items within elements is set to 8.
- The average number of elements in a sequence is set to 8.

Performance

GSP：The GSP algorithm

FreeSpan：FreeSpan with alternative level projection

PrefixSpan-1：PrefixSpan withlevel-by-level projection

PrefixSpan-2：PrefixSpan withbi-level projection

- Both FreeSpan and PrefixSpan win GSP
- PrefixSpan methods are more efficient than FreeSpan
- When the support threshold is low, PrefixSpan-1 requires a major effort to generate projected databases.
- The performance curves of PrefixSpan-1 and PrefixSpan-2 are close when support threshold is not low

Figure 1

Performance (Cont.)

- This figure shows that using pseudo-projections for the projected databases that can be held in main memory improves efficiency of PrefixSpan further.
- pseudo-projection improves performance when the projected database can be held in main memory
- A related question becomes:
- “Can such a method be extended to disk-based processing?”

Figure 2

Performance (Cont.)

- This figure shows the I/O costs of PrefixSpan-1 and PrefixSpan-2 as well as of their pseudo-projection variations

- Dataset
- The number of items is set to 1000.
- There are 1 million sequences in the data set.
- The average number of items within elements is set to 8.
- The average number of elements in a sequence is set to 8.

Figure 3

Performance (Cont.)

- This figure shows the scalability of PrefixSpan-1 and PrefixSpan-2 with respect to the number of sequences.
- Both are linearly scalable
- Since the support threshold is set to 0.20%, PrefixSpan-2 performs better.

Figure 4

Conclusions

- PrefixSpanis efficient pattern growth method.
- The performance of PrefixSpan is batter then GSP and FreeSpan.
- Prefix-projection reduces the size of projected database and leads to efficient processing.
- Bi-level projection and pseudo-projection improve sequential pattern mining efficiency.

Reference

- [1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), pages 487–499, Santiago, Chile, Sept. 1994.
- [2] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 1995 Int. Conf. Data Engineering (ICDE’95), pages 3–14, Taipei, Taiwan, Mar. 1995.
- [3] C. Bettini, X. S. Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. Data Engineering Bulletin, 21:32–38, 1998.
- [4] M. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern mining with regular expression constraints. In Proc. 1999 Int. Conf. Very Large Data Bases (VLDB’99), pages 223–234, Edinburgh, UK, Sept. 1999.
- [5] J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. In Proc. 1999 Int. Conf. Data Engineering (ICDE’99), pages 106–115, Sydney, Australia, Apr. 1999.
- [6] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. Freespan: Frequent pattern-projected sequential pattern mining. In Proc. 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD’00), pages 355–359, Boston, MA, Aug. 2000.
- [7] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’00), pages 1– 12, Dallas, TX, May 2000.
- [8] H. Lu, J. Han, and L. Feng. Stock movement and ndimensional inter-transaction association rules. In Proc. 1998 SIGMOD Workshop Research Issues on Data Mining and Knowledge Discovery (DMKD’98), pages 12:1–12:7, Seattle, WA, June 1998.
- [9] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259–289, 1997.
- [10] B. O¨ zden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. In Proc. 1998 Int. Conf. Data Engineering (ICDE’98), pages 412–421, Orlando, FL, Feb. 1998.
- [11] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proc. 5th Int. Conf. Extending Database Technology (EDBT’96), pages 3–17, Avignon, France, Mar. 1996.

Discussions

- The database is static. If any data is added for mining, there will be some problems when doing PrefixSpan.
- If the database is dynamic, it will be too large. We have to consider old data while mining patterns.
- PrefixSpan cannot take duration into account for an item.

Discussions

- The database is static. If any data is added for mining, there will be some problems when doing PrefixSpan.
- Incremental database, data steam
- If the database is dynamic, it will be too large. We have to consider old data while mining patterns.
- Progressive database
- PrefixSpancannot take duration into account for an item.
- Time interval database

Download Presentation

Connecting to Server..