prefixspan mining sequential patterns efficiently by prefix projected pattern growth n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth PowerPoint Presentation
Download Presentation
PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

Loading in 2 Seconds...

play fullscreen
1 / 47

PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth - PowerPoint PPT Presentation


  • 334 Views
  • Uploaded on

PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. Jian Pei, Jiawei Han, Behzad Mortazavi-Asl , Helen Pinto, Qiming Chen, Umeshwar Dayal , Mei-Chun Hsu. 17 th ICDE’01. 組員: 沈郁棋    簡志佳 吳永斌   曾文彥 沈家譽. Outline. Introduction FreeSpan

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth' - mahola


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
prefixspan mining sequential patterns efficiently by prefix projected pattern growth

PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

Jian Pei, Jiawei Han, BehzadMortazavi-Asl, Helen Pinto,

Qiming Chen, UmeshwarDayal, Mei-Chun Hsu

17th ICDE’01

  • 組員:沈郁棋   簡志佳吳永斌   曾文彥沈家譽
outline
Outline
  • Introduction
  • FreeSpan
  • PrefixSpan algorithm
  • Improvement of PrefixSpan
  • Experimental results and performance
  • Conclusion
  • References
  • Discussions
introduction
Introduction
  • What is a sequence and sequential pattern mining?
  • Applications of sequential pattern mining
  • Previous method:
    • GSP algorithm
    • FreeSpan algorithm
  • New method proposed in this paper:
    • PrefixSpan
sequence and subsequence
Sequence and subsequence
  • A sequence is an ordered list of itemset
  • Each sequence consists of a list elements and each element consists of a set of items
  • Example: α = <a(abc)(ac)d(cf)> is a sequence
  • An element may contain an item or many items. Items within an element are unordered and we list them alphabetically.

Elements

Item

sequence and subsequence cont
Sequence and subsequence (Cont.)
  • If α = <a(abc)(ac)d(cf)> is a sequence, then β = <a(bc)df> is a subsequence of α.
  • The number of items in a sequence is the length of the sequence. For example, the length of α is 9.

Transaction database

Sequence database

v.s.

sequential pattern mining
Sequential Pattern Mining
  • Proposed by R. Agrawal and R. Srikant in 1995
  • Given a user-specified minimum support threshold, sequential pattern mining is to find all of the frequent subsequences.
  • Applications:
    • Analyses of customer purchase behavior
    • Web access pattern
    • Scientific experiments
    • Disease treatments
    • Natural disasters
    • DNA sequences
gsp algorithm

SID

  • Sequence
  • 10
  • <(bd)cb(ac)>
  • 20
  • <(bf)(ce)b(fg)>
  • 30
  • <(ah)(bf)abf>
  • 40
  • <(be)(ce)d>
  • 50
  • <a(bd)bcb(ade)>
GSP Algorithm
  • An Apriori-like method
  • Example:

Seed set

1st scan: 8 cand. 6 length-1 seq. pat.

<a> <b> <c> <d> <e> <f> <g> <h>

2nd scan: 51 cand. 19 length-2 seq. pat.

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>

Seed set

<abb> <aab> <aba> <baa> <bab> …

3rd scan: 46 cand. 19 length-3 seq. pat.

min_sup=2

disadvantages of gsp
Disadvantages of GSP
  • Potentially huge set of candidate sequences
    • Apriori-based method may generate a large set of candidate sequences which contains all the possible permutations of the elements.
    • For example, if there are 1000 frequent sequences of length-1, , the number of candidates will be

Derived from the set

Derived from the set

,

disadvantages of gsp cont
Disadvantages of GSP (Cont.)
  • Multiple scans of databases
    • The length of each candidate sequence grows by one at each database scan.
    • To find <(abc) (abc) (abc) (abc) (abc)>, GSP must scan the database at least 15 times.
disadvantages of gsp cont1
Disadvantages of GSP (Cont.)
  • Difficulties at mining long sequential patterns
    • There is only a single sequence of length 100, min_sup=1
    • length-1 candidate sequences :
    • length-2 candidate sequences :
    • length-3 candidate sequences :
    • Total
slide11

FreeSpan : Frequent Pattern-Projected Sequential Pattern Mining

  • A divide-and-conquer approach
    • Recursively project a sequence database into a set of smaller databases based on the current set of frequent patterns.
    • Mining each projected database to find its patterns.
slide12

FreeSpan

  • finding f_list:
      • finding the set of frequent items, and they are listed in support descending order.

Items: a,b,c,d,e,f,g

(item, support): (a,4),(b,4),(c,4),(d,3),(e,3),(f,3),(g,1)

Support threshold:2

f_list = a:4, b:4, c:4, d:3, e:3, f:3

SID: sequence_id

slide13

FreeSpan

f_list = a:4, b:4, c:4, d:3, e:3, f:3

The complete set of sequential patterns in sequence database can be divided into 6 disjoint subsets:

The ones containing only the item a.

The ones containing item b but containing no items after b in f_list.

The ones containing item c but containing no items after c in f_list.

The ones containing item d but containing no items after d in f_list.

The ones containing item e but containing no items after e in f_list.

The ones containing item f.

slide14

FreeSpan

Finding sequential patterns containing only item a.

By scanning sequence database once, the only two

sequential patterns are found: <a>, <aa>

<aa>

support:2= frequent threshold: 2

<a> =<(a)>

support:4> frequent threshold: 2

<aaa>

support:1<= frequent threshold: 2

slide15

FreeSpan

Finding sequential patterns containing item b but no item after b in f_list.

This can be achieved by constructing the {b}-projected database.

Finding sequences in sequence database containing item b.

<a(abc)(ac)d(cf)>

<(ad)c(bc)(ae)>

<(ef)(ab)(df)cb>

<eg(af)cbc>

Removing all items after b in f_list(a:4, b:4, c:4, d:3, e:3, f:3).

<a(ab)(a)>

<(a)(b)(a)>

<(ab)b>

<(a)b>

<a(ab)a>

<aba>

<(ab)b>

<ab>

slide16

FreeSpan

By scanning the projected base once more,

all sequential patterns containing item b but no item

after b in f_list are found:

{b}-projected database

<a(ab)a>

<aba>

<(ab)b>

<ab>

<b>, <ab>, <ba>, <(ab)>

Support=3

And then, finding other subsets of sequential patterns.

slide17

FreeSpan

  • FreeSpanis more efficient than GSP.
  • The major cost of FreeSpan is to deal with projected databases. If a pattern appears in each sequence of adatabase, its projected database does not shrink .

{f}-projected database

f_list = a:4, b:4, c:4, d:3, e:3, f:3

<a(abc)(ac)d(cf)>

<(ef)(ab)(df)cb>

<e(af)cbc>

slide18

PrefixSpan

(Prefix-Projected Sequential Pattern Growth)

prefix
Prefix
  • Given two sequences α=<a1a2…an> and β=<b1b2…bm>, m≤n.
  • Sequence β is called a prefix of α if and only if:
    • bi= ai for i ≤ m-1;
    • bm ⊆ am;
  • Example :
    • α =<a(abc)(ac)d(cf)>
    • β =<a(abc)a>

α=<a1a2…am-1 am…..an>

β=<b1b2…bm-1bm >

postfix projection
Postfix (Projection)
  • Let α’ =<a1,a2…an> be the projection of α w.r.t. prefix
  • β=<a1a2…am-1a’m> (m ≤n)
  • Sequence γ=<a’’mam+1…an> is called the postfix of α w.r.t. prefix β, denoted as γ= α/ β, where a’’m=(am - a’m).
  • We also denote α =β⋅ γ.
    • Example:
      • α’ =<a(abc)(ac)d(cf)>,
      • β =<a(abc)a>,
      • γ=<(_c)d(cf)>.
prefixspan algorithm
PrefixSpan – Algorithm
  • Input of the algorithm :

A sequence database S, and the minimum support threshold min_support.

  • Output of the algorithm:

The complete set of sequential patterns.

  • Subroutine: PrefixSpan(α, L, S|α).
  • Parameters:
    • α: sequential pattern,
    • L: the length of α;
    • S|α: the α-projected database, if α ≠<>; otherwise; the sequence database S.
  • Call PrefixSpan(<>,0,S).
prefixspan example
PrefixSpan - Example
  • Step 1: Find length-1 sequential patterns. (min_sup=2)

length-1 sequential patterns : <a><b><c><d><e><f>

slide23

Step 2: divide search space.

Length-1 sequential patterns

<a>, <b>, <c>, <d>, <e>, <f>

Having prefix <c>, …, <f>

Length-2 sequential

patterns

<aa>:2, <ab>:4, <(ab)>:2,

<ac>:4, <ad>:2, <af>:2

… …

Support=1 < min_sup

<aaa> is not sequential pattern

Terminate!!

efficiency of prefixspan
Efficiency of PrefixSpan
  • No candidate sequence needs to be generated
  • Projected databases keep shrinking
  • Major cost of PrefixSpan: constructing projected databases

-> Can be improved by bi-level projections

bi level projection
Bi-level Projection
  • The major cost of PrefixSpan is to construct projected databases.
  • A bi-level projection scheme is proposed to reduce the number and the size of projected databases.
bi level projection example
Bi-level Projection - Example
  • Scan S to find the length-1 sequential patterns:<a>, <b>, <c>, <d>, <e>, <f>.
bi level projection example1
Bi-level Projection - Example
  • Instead of constructing projected databases for each length-1 sequential pattern, we construct a 6*6 lower triangular matrix M, as shown in Table 3.

Sequence <cc> appears in three sequences in S.

bi level projection example2
Bi-level Projection - Example
  • Instead of constructing projected databases for each length-1 sequential pattern, we construct a 6*6 lower triangular matrix M, as shown in Table 3.

supports(<ac>) = 4, supports(<ca>) = 2 and supports(<(ac)>) = 1.

bi level projection example3
Bi-level Projection - Example
  • The <ab>-projected database contains three sequences: <(_c)(ac)(cf)>, <(_c)a> and <c>.
  • Three frequent items are found: <a>, <c> and <(_c)>.

Length-2 pattern <(_c)a> can be generated

bi level projection example4
Bi-level Projection - Example
  • Using level-by-levelprojection, to find the complete set of 53 sequential patterns, 53 projected databases are constructed.
  • Only 22 projected databases are constructed by bi-level projection.
optimization
Optimization

“do we need to include every item in a postfix in the projected databases?”

optimization1
Optimization
  • Consider the <ac>-projected database…

Exclude item d from <ac>-projected database!

pseudo projection
Pseudo-Projection
  • By examining a set of projected databases, one can observe that postfixes of a sequence often appear repeatedly in recursive projected databases.
  • Sequence <a(abc)(ac)d(cf)> has postfixes <(abc)(ac)d(cf)> and <(_c)(ac)d(cf)> as projections in <a>- and <ab>-projected databases, respectively.
pseudo projection1
Pseudo-Projection
  • Every projection consists of two pieces of information: pointer to the sequence in database and offset of the postfix in the sequence.
pseudo projection2
Pseudo-Projection
  • For example, suppose the sequence database S in Table 1 can be held in main memory.
  • When constructing <a>-projected database, the projection of sequence s1 = <a(abc)(ac)d(cf)> consists two pieces: a pointer to s1 and offset set to 2, i.e., postfix (abc)(ac)d.
  • <ab>-projected database contains a pointer to s1 and offset set to 4.
dataset
Dataset
  • The number of items is set to 1000.
  • There are 10000 sequences in the data set.
  • The average number of items within elements is set to 8.
  • The average number of elements in a sequence is set to 8.
performance
Performance

GSP:The GSP algorithm

FreeSpan:FreeSpan with alternative level projection

PrefixSpan-1:PrefixSpan withlevel-by-level projection

PrefixSpan-2:PrefixSpan withbi-level projection

  • Both FreeSpan and PrefixSpan win GSP
  • PrefixSpan methods are more efficient than FreeSpan
  • When the support threshold is low, PrefixSpan-1 requires a major effort to generate projected databases.
  • The performance curves of PrefixSpan-1 and PrefixSpan-2 are close when support threshold is not low

Figure 1

performance cont
Performance (Cont.)
  • This figure shows that using pseudo-projections for the projected databases that can be held in main memory improves efficiency of PrefixSpan further.
  • pseudo-projection improves performance when the projected database can be held in main memory
  • A related question becomes:
  • “Can such a method be extended to disk-based processing?”

Figure 2

performance cont1
Performance (Cont.)
  • This figure shows the I/O costs of PrefixSpan-1 and PrefixSpan-2 as well as of their pseudo-projection variations
  • Dataset
  • The number of items is set to 1000.
  • There are 1 million sequences in the data set.
  • The average number of items within elements is set to 8.
  • The average number of elements in a sequence is set to 8.

Figure 3

performance cont2
Performance (Cont.)
  • This figure shows the scalability of PrefixSpan-1 and PrefixSpan-2 with respect to the number of sequences.
  • Both are linearly scalable
  • Since the support threshold is set to 0.20%, PrefixSpan-2 performs better.

Figure 4

conclusions
Conclusions
  • PrefixSpanis efficient pattern growth method.
    • The performance of PrefixSpan is batter then GSP and FreeSpan.
    • Prefix-projection reduces the size of projected database and leads to efficient processing.
    • Bi-level projection and pseudo-projection improve sequential pattern mining efficiency.
reference
Reference
  • [1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), pages 487–499, Santiago, Chile, Sept. 1994.
  • [2] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 1995 Int. Conf. Data Engineering (ICDE’95), pages 3–14, Taipei, Taiwan, Mar. 1995.
  • [3] C. Bettini, X. S. Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. Data Engineering Bulletin, 21:32–38, 1998.
  • [4] M. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern mining with regular expression constraints. In Proc. 1999 Int. Conf. Very Large Data Bases (VLDB’99), pages 223–234, Edinburgh, UK, Sept. 1999.
  • [5] J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. In Proc. 1999 Int. Conf. Data Engineering (ICDE’99), pages 106–115, Sydney, Australia, Apr. 1999.
  • [6] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. Freespan: Frequent pattern-projected sequential pattern mining. In Proc. 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD’00), pages 355–359, Boston, MA, Aug. 2000.
  • [7] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’00), pages 1– 12, Dallas, TX, May 2000.
  • [8] H. Lu, J. Han, and L. Feng. Stock movement and ndimensional inter-transaction association rules. In Proc. 1998 SIGMOD Workshop Research Issues on Data Mining and Knowledge Discovery (DMKD’98), pages 12:1–12:7, Seattle, WA, June 1998.
  • [9] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259–289, 1997.
  • [10] B. O¨ zden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. In Proc. 1998 Int. Conf. Data Engineering (ICDE’98), pages 412–421, Orlando, FL, Feb. 1998.
  • [11] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proc. 5th Int. Conf. Extending Database Technology (EDBT’96), pages 3–17, Avignon, France, Mar. 1996.
discussions
Discussions
  • The database is static. If any data is added for mining, there will be some problems when doing PrefixSpan.
  • If the database is dynamic, it will be too large. We have to consider old data while mining patterns.
  • PrefixSpan cannot take duration into account for an item.
discussions1
Discussions
  • The database is static. If any data is added for mining, there will be some problems when doing PrefixSpan.
  • Incremental database, data steam
  • If the database is dynamic, it will be too large. We have to consider old data while mining patterns.
  • Progressive database
  • PrefixSpancannot take duration into account for an item.
  • Time interval database