你的一小步，我的一大步

你的一小步，我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University

Jen-Wei Huang

* http://www.wretch.cc/blog/EtudeBIKE Jen-Wei Huang

* http://www.giant-bicycles.com/zh-TW/ Jen-Wei Huang

Jen-Wei Huang

* http://cape7.pixnet.net/blog Jen-Wei Huang

* http://www.wretch.cc/blog/orzboyz * http://blog.sina.com.tw/9winds/ * http://atomcinema.pixnet.net/blog Jen-Wei Huang

Jen-Wei Huang

* http://www.amazon.com Jen-Wei Huang

* http://www.hq.nasa.gov/office/pao/History/ap11ann/kippsphotos/apollo.html Jen-Wei Huang

A General Model for Sequential Pattern Mining with a Progressive Database Jen-Wei Huang, Chi-Yao Tseng, Jian-Chih Ou and Ming-Syan Chen National Taiwan University * IEEE Trans. on Knowledge and Data Engineering, Vol. 20, No. 6, June 2008

Outlines • Introduction • Preliminaries • Algorithm Pisa • Experiments • Conclusions • Q & A Jen-Wei Huang 16

Introduction to SPM • “Mining of frequently occurring patterns related to time or other sequences.” • J. Han, Data Mining – Concepts and Techniques • “Given a set of sequences, find the complete set of frequent subsequences” • J. Pei, PrefixSpan • Ex) What items one will buy if he/she has bought some certain items Jen-Wei Huang 17

Time-related data • Customers’ buying behavior • Natural phenomena • Sensor network data • Web access patterns • Stock price changes • DNA sequence applications Jen-Wei Huang 18

Definition • Let I= {x1, x2, ..., xn} be a set of different items. • An element e, denoted by (xi xj ...), is a subset of items ⊆ I of which items appear in a sequence at the same time. • A sequence s, denoted by < e1, e2, ..., em >, is an ordered list of elements. • A sequence database Db contains a set of sequences and |Db| represents the number of sequences in Db. Jen-Wei Huang 19

Definition • A sequence α = < a1, a2, ..., an > is a subsequence of another sequence β = < b1, b2, ..., bm > if • there exists a set of integers, 1 ≤ i1 < i2 < ... < in ≤ m, such that a1 ⊆ bi1 , a2 ⊆ bi2 , ..., and an ⊆ bin . Jen-Wei Huang 20

Definition • The sequential pattern mining can be defined as • "Given a sequence database, Db, and a user-defined minimum support, min_sup, find the complete set of subsequences whose occurrence frequencies ≥ min_sup ∗ |Db|." Jen-Wei Huang 21

Three Categories • Depending on the management of the corresponding database, sequential pattern mining can be divided into three categories, namely sequential pattern mining with • a static database. • an incremental database. • a progressive database. Jen-Wei Huang 22

How To Do Sequential Pattern Mining on a Static Database An Overview

How? • Apriori-like algorithms • AprioriAll – by Agrawal et al • GSP – by R. Srikant et al • Partition-based algorithms • FreeSpan – by J. Han et al • PrefixSpan – by J. Pei et al • Vertical format algorithms • SPADE – by Zaki et al • SPAM – by Ayres et al jwhuang National Taiwan University

Apriori-like Algorithms • 1.Sort phase • Sort the database • Customer id as the primary key and time as the second key • 2.Litemset phase • Count the frequency of each itemset • The fraction of customers who bought the itemset jwhuang National Taiwan University

Apriori-like Algorithms • 3.Transformation phase • Transform each tx to all litemsets in the form of C01: <(1,5) (2) (3) (4)> C02: <(1) (3) (4) (3,5)> C03: <(1) (2) (3) (4}> C04: <(1) (3) (5)> C05: <(4) (5)> jwhuang National Taiwan University

Jen-Wei Huang

Apriori-like Algorithms • 4.Mining phase • Apriori-like algorithm • 5.Maximal phase • Find the maximum patterns jwhuang National Taiwan University

Jen-Wei Huang

Therefore, frequent sequential patterns are: <1 2> <3 4> <3 5> <3 6> <3 7> <4 6> <5 6> <7 6> <3 4 6> <3 5 6> <3 7 6> According to mappings, original frequent sequential patterns are: <10 20> <30 40> <30 70> <30 90> <30 {40 70}> <40 90> <70 90> <{40 70} 90> <30 40 90> <30 70 90> <30 {40 70} 90> Jen-Wei Huang

According to mappings, original frequent sequential patterns are: <10 20> <30 40> <30 70> <30 90> <30 {40 70}> <40 90> <70 90> <{40 70} 90> <30 40 90> <30 70 90> <30 {40 70} 90> Because <30 40> and <30 70> are contained by <30 {40 70}> <40 90> and <70 90> are contained by <{40 70} 90> <30 40 90> and <30 70 90> are contained by <30 {40 70} 90>, final maximal sequential patterns are: <10 20> <30 90> <30 {40 70}> <{40 70} 90> <30 {40 70} 90> Jen-Wei Huang

Related Works • Static database • AprioriAll – by Agrawal et al • GSP – by R. Srikant et al • SPADE – by Zaki et al • FreeSpan – by J. Han et al • PrefixSpan – by J. Pei et al • SPAM – by Ayres et al Jen-Wei Huang 33

Related Works • Incremental database • ISM – by Parthasarathy et al • IncSP – by Lin et al • ISE – by Masseglia et al • IncSpan – by Cheng et al • MILE – by Chen et al Jen-Wei Huang 34

Motivation • The assumption of having a static database may not hold in practice. • The data in real world change on the fly. • Finding sequential patterns in an incremental database may lack of interest to the users. • It is noted that users are usually more interested in the recent data than the old ones. Jen-Wei Huang 35

Motivation • If a certain sequence does not have any newly arriving elements, this sequence will still stay in the database and undesirably contribute to |Db|. • New sequential patterns which appear frequently in the recent sequences may not be considered as frequent sequential patterns. Jen-Wei Huang 36

Definition -- Period of Interest • Period of Interest (abbreviated as POI) is a sliding window • whose length is a user-specified time interval, • continuously advancing as the time goes by. • The sequences having elements whose timestamps fall into this period, POI, contribute to the |Db| for current sequential patterns. Jen-Wei Huang 37

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … S01 S02 S03 S04 S05 S06 Db1,5 Db2,6 Db3,7 Db4,8 Db5,9 Db6,10 A C BD B C AD B AD B A C A A BC B C D D C A BC D D B A C D A C SID time POI=5, min_supp=0.5

Outlines • Introduction • Preliminaries • Algorithm Pisa • Experiments • Conclusions • Q & A Jen-Wei Huang 39

Progressive Sequential Pattern • Progressive sequential pattern mining problem is defined as follows • "Given a progressive sequence database, a user-specified period of interest, and a user-defined minimum support threshold, find the complete set of frequent subsequences whose occurrence frequencies are greater than or equal to the minimum support times the number of sequences in every period of interest of the database." Jen-Wei Huang 40

Naïve Algorithm • Use conventional static sequential pattern mining algorithms to mine sequential patterns separately from all combination of POIs • e.g., Db1,5, Db2,6, Db3,7, Db4,8, Db5,9, etc. • For the sequence database which has the elements appearing in the interval of n timestamps, the total number of POIs in this interval is equal to (n − POI +1). Jen-Wei Huang 41

Prior Work • The only prior work on progressive database is GSP+ and MFS+ proposed by Zhang based on static algorithms GSP and MFS (also derived by the same authors). • However, these algorithms still have to re-mine each sub-database using the static algorithms GSP and MFS. • Nevertheless, the performance improvement of GSP+ and MFS+ over GSP and MFS is only within 15% as reported by their authors. Jen-Wei Huang 42

Algorithm DirApp • Stands for Direct Append. • Consists of two procedures • Progressively Updating • abbreviated as PrUp • Immediately Filtering • abbreviated as ImFi Jen-Wei Huang 43

Procedure PrUp • When progressively reading newly incoming elements, Procedure PrUp can • update each sequence in the sequence database • generate candidate sequential patterns • calculate occurrence frequencies of all candidate equential patterns in the current POI. Jen-Wei Huang 44

Procedure ImFi • DirApp uses Procedure ImFi to • filter out obsolete data from the existing sequence database • prune away obsolete candidate sequential patterns from the candidate set. • report the most up-to-date frequent sequential patterns to the user in every POI Jen-Wei Huang 45

A C BD B C AD B AD B A C A A BC B C D D C A BC D D B A C D t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … A C SID time S01 S02 S03 S04 S01 S05 S06 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … time A B C AD B

A B B C AD time t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … Example Jen-Wei Huang 47

(1) (4) A B C AD B t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 … (2) (3)

(4) (5) A B C AD B t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

(5) (6) A B C AD B t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 …

你的一小步，我的一大步

你的一小步，我的一大步

Presentation Transcript