AutoPlait: Automatic Mining of Co-evolving Time Sequences

AutoPlait: Automatic Mining of Co-evolving Time Sequences Yasuko Matsubara (Kumamoto University) Yasushi Sakurai (Kumamoto University) Christos Faloutsos (CMU) Y. Matsubara et al.

Motivation Given: co-evolving time-series • e.g., MoCap (leg/arm sensors) “Chicken dance” Time Y. Matsubara et al.

Motivation Given: co-evolving time-series • e.g., MoCap (leg/arm sensors) “Chicken dance” Q. Any distinct patterns? Time Q. If yes, how many? Q. What kind? Y. Matsubara et al.

Motivation Challenges: co-evolving sequences • Unknown # of patterns (e.g., beaks) • Different durations … Time Y. Matsubara et al.

Motivation Challenges: co-evolving sequences • Unknown # of patterns (e.g., beaks) • Different durations • Q. Can we summarize it automatically ? … Time Y. Matsubara et al.

Motivation Goal:find patterns that agree with human intuition Time Y. Matsubara et al.

Motivation Goal:find patterns that agree with human intuition AutoPlait: “fully-automatic” mining algorithm Y. Matsubara et al.

Importance of “fully-automatic” No magic numbers! ... because, Big data mining: -> we cannot afford human intervention!! Manual sensitive to the parameter tuning long tuning steps (hours, days, …) Automatic (no magic numbers) - no expert tuning required Y. Matsubara et al.

Outline • Motivation • Problem definition • Compression & summarization • Algorithms • Experiments • AutoPlait at work • Conclusions Y. Matsubara et al.

Problem definition Key concepts • Bundle: • Segment: • Regime: • Segment-membership: given hidden hidden hidden Y. Matsubara et al.

Problem definition • Bundle : set of d co-evolving sequences given Bundle X (d=4) Y. Matsubara et al.

Problem definition • Segment: convert X -> m segments, S hidden s1 s2 s8 s3 s4 s5 s6 s7 Segment (m=8) Y. Matsubara et al.

Problem definition • Regime: segment groups: : model of regime r hidden Regimes (r=4) beaks wings Y. Matsubara et al.

Problem definition • Segment-membership: assignment hidden F = { 2, 4, 1, 3, 2, 4, 1, 3 } Segment-membership (m=8) Y. Matsubara et al.

Problem definition • Given: bundle X Y. Matsubara et al.

Problem definition • Given: bundle X • Find: compact description C of X Y. Matsubara et al.

Problem definition • Given: bundle X • Find: compact description C of X m segments r regimes Segment-membership Y. Matsubara et al.

Main ideas Goal: compact description of X without user intervention!! Challenges: Q1. How to generate ‘informative’ regimes ? Q2. How to decide # of regimes/segments ? Y. Matsubara et al.

Main ideas Goal: compact description of X without user intervention!! Challenges: Q1. How to generate ‘informative’ regimes ? Idea (1): Multi-level chain model Q2. How to decide # of regimes/segments ? Idea (2): Model description cost Y. Matsubara et al.

Idea (1): MLCM: multi-level chain model Q1. How to generate ‘informative’ regimes ? beaks claps Model wings Sequences Regimes Y. Matsubara et al.

Idea (1): MLCM: multi-level chain model Q1. How to generate ‘informative’ regimes ? Idea (1): Multi-level chain model • HMM-based probabilistic model • with “across-regime” transitions beaks claps Model wings Sequences Regimes Y. Matsubara et al.

Idea (1): MLCM: multi-level chain model Details Single HMM parameters across-regime transition prob. r regimes (HMMs) Y. Matsubara et al.

Idea (1): MLCM: multi-level chain model Details state 1 Single HMM parameters across-regime transition prob. r regimes (HMMs) state 2 state 2 state 1 Regime switch Regimes r=2 Regime 1 (k=3) Regime 2 (k=2) state 3 Regime2 “wings” Regime1 “beaks” Y. Matsubara et al.

Idea (2): model description cost Q2. How to decide # of regimes/segments ? Idea (2): Model description cost • Minimize coding cost • find “optimal” # of segments/regimes Y. Matsubara et al.

Idea (2): model description cost Idea: Minimize encoding cost! CostM(M) + Costc(X|M) min ( ) Coding cost Model cost (# of r, m) Good compression Good description Y. Matsubara et al.

Idea (2): model description cost Details Total cost of bundle X, given C Y. Matsubara et al.

Idea (2): model description cost Details Total cost of bundle X, given C # of segments/regimes duration/dimensions segment- membership F Coding cost of X given Θ Model description cost of Θ segment lengths Y. Matsubara et al.

AutoPlait Overview Iteration 0 r=1, m=1 Start! r=1 1 2 3 4 5 6 7 Iteration 0 1 2 3 4 5 6 7 Iteration Y. Matsubara et al.

AutoPlait Overview Iteration 1 r=2, m=4 r=1 Split r=2 0 1 2 3 4 5 6 7 Iteration Y. Matsubara et al.

AutoPlait Overview Iteration 1 r=2, m=4 Iteration 2 r=3, m=6 Split r=1 r=2 r=3 0 1 2 3 4 5 6 7 Iteration Y. Matsubara et al.

AutoPlait Overview Iteration 1 r=2, m=4 Iteration 2 r=3, m=6 r=1 Split r=2 r=4 Iteration 4 r=4, m=8 1 2 3 4 5 6 7 Iteration r=3 0 1 2 3 4 5 6 7 Iteration Y. Matsubara et al.

AutoPlait Algorithms • CutPointSearch Find good cut-points/segments • RegimeSplit Estimate good regime parameters Θ • AutoPlait Search for the best number of regimes (r=2,3,4…) Inner-most loop Inner loop Outer loop Y. Matsubara et al.

1.CutPointSearch Inner-most loop Given: - bundle - parameters of two regimes Find: cut-points of segment sets S1, S2, Y. Matsubara et al.

1.CutPointSearch Inner-most loop Details state 1 state 2 DP algorithm to compute likelihood: state 3 switch?? switch?? state 1 state 2 Y. Matsubara et al.

1.CutPointSearch Inner-most loop Details Theoretical analysis Scalability - It takes time (only single scan) - n: length of X - d: dimension of X - k: # of hidden states in regime Accuracy It guarantees the optimal cut points - (Details in paper) Y. Matsubara et al.

2.RegimeSplit Innerloop Given: - bundle Find: two regimes 1. find cut-points of segment sets: S1, S2 2. estimate parameters of two regimes: Y. Matsubara et al.

2.RegimeSplit Innerloop Details Two-phase iterative approach • Phase 1: (CutPointSearch) • Split segments into two groups : • Phase 2: (BaumWelch) • Update model parameters: Phase 1 Phase 2 Y. Matsubara et al.

3.AutoPlait Outer loop Given: - bundle Find: r regimes (r=2, 3, 4, … ) - i.e., find full parameter set Y. Matsubara et al.

3.AutoPlait Outer loop Split regimes r=2,3,…, as long as cost keeps decreasing - Find appropriate # of regimes Y. Matsubara et al.

3.AutoPlait Outer loop Split regimes r=2,3,…, as long as cost keeps decreasing - Find appropriate # of regimes r=2, m=4 r=4, m=8 r=3, m=6 Y. Matsubara et al.

Experiments We answer the following questions… Q1. Sense-making Can it help us understand the given bundles? Q2. Accuracy How well does it find cut-points and regimes? Q3. Scalability How does it scale in terms of computational time? Y. Matsubara et al.

Q1. Sense-making MoCap data AutoPlait (NO magic numbers) • DynaMMo (Li et al., KDD’09) pHMM (Wang et al., SIGMOD’11) Y. Matsubara et al.

Q1. Sense-making MoCap data AutoPlait (NO magic numbers) Y. Matsubara et al.

Q2. Accuracy (a) Segmentation (b) Clustering Target AutoPlait needs “no magic numbers” Y. Matsubara et al.

Q3. Scalability Wall clock time vs. data size (length) : n AutoPlait scales linearly, i.e., O(n) pHMM DynaMMo AutoPlait Y. Matsubara et al.

AutoPlait at work AutoPlait is capable of various applications, e.g., App1. Model analysis - Web-click sequences App2. Event discovery - Google Trend data Y. Matsubara et al.

AutoPlait: Automatic Mining of Co-evolving Time Sequences

AutoPlait: Automatic Mining of Co-evolving Time Sequences

Presentation Transcript

Data Mining

CS583 – Data Mining and Text Mining

CS490D: Introduction to Data Mining Prof. Chris Clifton

7. Sequence Mining

Data Mining Group1

Web Mining

Chapter 6 Web Content Mining

How do we know if a population is evolving?

Automatic methods for functional annotation of sequences

Mining Time Series Data

Data Mining: Concepts and Techniques Mining data streams

SEQUENCES

Evolving Opportunities in the Russian Mining Sector Toronto

DEMON: Mining and Monitoring Evolving Data

EECS 800 Research Seminar Mining Biological Data

Humane Data Mining: The Next Frontier

Automatic and Reliable Functional Annotation of Proteins

Arithmetic Sequences

Training for SECU LIFE ST, 2011

Data Mining Algorithms for Large-Scale Distributed Systems

WHAT IS DATA MINING?

Sequences Revision