Mining serial episode rules with time lags over multiple data streams
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

Mining Serial Episode Rules with Time Lags over Multiple Data Streams PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on
  • Presentation posted in: General

Mining Serial Episode Rules with Time Lags over Multiple Data Streams. Tung-Ying Lee, En Tzu Wang Dept. of CS, National Tsing Hua Univ. (Taiwan) Arbee L.P. Chen Dept. of CS, National Chengchi Univ. (Taiwan) DaWaK’08. Outline. Introduction Related work Preliminaries

Download Presentation

Mining Serial Episode Rules with Time Lags over Multiple Data Streams

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Mining serial episode rules with time lags over multiple data streams

Mining Serial Episode Rules with Time Lags over Multiple Data Streams

Tung-Ying Lee, En Tzu Wang

Dept. of CS, National Tsing Hua Univ. (Taiwan)

Arbee L.P. Chen

Dept. of CS, National Chengchi Univ. (Taiwan)

DaWaK’08


Outline

Outline

  • Introduction

  • Related work

  • Preliminaries

    • Support of a serial episode

    • Support/ confidence of a serial episode rule

    • Data structure used in the algorithms

  • Algorithms

    • LossyDL

    • TLT

  • Experiments

  • Conclusions


Introduction

Introduction

  • In many applications, data are generated as a form of continuous data streams.

    • Continuously detecting flow and occupancy of a road to qualify the congestion condition of a road forms data streams

    • When roads A and B have heavy traffic, 5 mins later, road C will most likely be congested

    • Regarding the values of flows and occupancies coming from roads as an environment of multi-streams and finding serial episode rules from it

    • Serial episode rules with time lags (SER) : XlagY


Related work

Related Work

  • Finding episodes/episode rules from static time series data has been studied for decades

  • Finding episodes over data streams

    • Serial episodes [SSDBM04]

    • Episodes [KDD07]

Precursor

Successor

B

D

A

D

B

A

E

L

C

Serial episode rule

Episode

D

B

A

Serial episode


Preliminaries

Preliminaries

  • Environment: a centralized system collecting n synchronized data streams DS1, DS2, …, DSn

    • n-tuple event: a set of items coming from all streams at the same time

    • itemset: a subset of an n-tuple event

    • serial episode: described as an ordered list of itemsets

e.g. serial episode (aA)(bB)

Itemset {gA}

time: 1, 2, 3, 4, 5, 6, 7, 8

DS1: a, b, b, c, g, a, b, f

DS2: A, B, S, G, A, B, A, F

DSn: , , , , , , , 

n-tuple event


Mining serial episode rules with time lags over multiple data streams

Preliminaries (cont.)

  • Minimal occurrence: given a serial episode S, a time interval [a, b] is a minimal occurrence of S, if

    • S occurs in [a, b]

    • S does not occur in any proper subintervals of [a, b]

    • If (b-a+1)  T, a time bound given by users, [a, b] is valid

  • MO(S): the set of all minimal occurrences of S

  • Supp(S): the number of valid minimal occurrences of S

Time bound T: 3

DS1

DS2


Mining serial episode rules with time lags over multiple data streams

Preliminaries (cont.)

  • A SER is R: S1Lag = LS2

  • Supp(R): |{[a, b]|[a, b]MO(S1)[a, b]: valid  [c, d] MO(S2)[c, d]: valid s.t. (c-a) = L}

  • Conf(R) = Supp(R)/Supp(S1)

4

Time bound T: 3

DS1

DS2


Mining serial episode rules with time lags over multiple data streams

Preliminaries (cont.)

  • Problem Formulation: given 4 parameters

    • the maximum time lag (Lmax)

    • the minimum support (minsup)

    • the minimum confidence (minconf)

    • the time bound (T)

  • Find all SERs e.g. R: S1Lag = LS2 satisfying

    • L  Lmax

    • Supp(R)  N  minsup, (N: the number of received n-tuple events)

    • Conf(R)  minconf

    • Calculating supports for serial episodes and SERs must take T into account


Mining serial episode rules with time lags over multiple data streams

Preliminaries (cont.)

  • Using the prefix tree for keeping serial episodes

  • S: a serial episode, X: an item

    • S+X: X follows S

    • S+_X: X and the last itemset in S appear at the same time

Level 0

Root

A

B

Level 1

_B

B

Serial episode (AB)

Level 2

Serial episode (A)(B)


Mining serial episode rules with time lags over multiple data streams

B

B

[2, 3]

[1, 3]

LossyDL

  • The concept of LossyDL: keeping the valid minimal occurrences of a serial episode for generating rules

Processing C can generate (B)(C): [2, 3] and (BC): [3, 3]

At time point = 3, a 2-tupe event (BC) arrives, T = 3

Each item in the current 2-tuple event needs to be processed (traversing in a bottom-up order)

B

A

[2, 2]

[3, 3]

[1, 1]

The last two minimal occurrences needs to be checked

B

[1, 2]

[1, 3]: not minimal

Using Lossy Counting [VLDB02], whenever N  0 mod 1/, the oldest minimal occurrence is removed


Mining serial episode rules with time lags over multiple data streams

LossyDL (Rule Generation)

  • Mining SERs

    • For any two serial episode with supports  (minsup  )  N are checked to see if any minimal occurrences of them can be combined. Then, Supp(R) can be computed

    • For each R: S1Lag = LS2, it will be returned if

      • Supp(R)  (minsup  )  N, and

      • (Supp(R) + N)/Supp(S1) minconf


Mining serial episode rules with time lags over multiple data streams

TLT

  • A lot of minimal occurrences are kept in LossyDL, but only the last two are used while updating

    • Keeping supports instead of the minimal occurrence lists

    • How to generate rules without the minimal occurrence lists?

    • Re: using the following observations to prune the insignificant rules

  • Observations

    • XL(AB) and XLA, obviously Supp(XLA)  Supp(XL(AB)): XL(AB) is not significant if XLA does not satisfy one of minsup and minconf

    • (AB)L(CD) and ALC, obviously Supp(ALC)  Supp((AB)L(CD)): (AB)L(CD) is not significant if Supp(ALC) < Supp(AB)  minconf


Mining serial episode rules with time lags over multiple data streams

TLT (cont.)

  • Observations (cont.):

    • Given a SER: (A)(B)5(CD), and T = 3

      • A1B or A2B, that is ApB, 0<p< T (T1 types)

      • A1B4(CD), A2B3(CD), that is ApBLp(CD)

      • Supp(ApBLp(CD))  min(Supp(ApB), Supp(BLpC))

      • (A)(B)5(CD) is not significant, if

        • pmin(Supp(ApB), Supp(BLpC)) < Supp(A)(B)  minconf

  • Using the observations to prune insignificant rules

  • Time lag table (TLT)

    • ALB is a reduced SER, if A and B are single items

    • For finding S1LmaxS2, the reduced SERs having a time lag at most Lmax+T1 (from the first itemset of precursor to the last itemset of successor)

    • Using Lmax+T1 Time Lag Tables to keep the supports of reduced SER


Mining serial episode rules with time lags over multiple data streams

TLT (cont.)

  • The support and the last two minimal occurrences of an serial episode are kept in the prefix tree

    • Keeping supports instead of keeping minimal occurrence lists

    • Keeping the last two minimal occurrences for updating the supports

    • WheneverN  0 mod 1/, all supports are decreased by 1

  • In addition, the last Lmax+T1 n-tuple events are kept for updating the Time Lag Tables


Mining serial episode rules with time lags over multiple data streams

TLT (Rule Generation)

  • Mining SERs

    • Any two serial episode with supports  (minsup  )  N form the candidate SERs

      • A candidate SER will be returned if it can pass the pruning rules from the above observations


Mining serial episode rules with time lags over multiple data streams

Experiments

  • Two real dataset

    • PDOMEI: the dataset contains the dryness and climate indices derived by experts, usually used to predict droughts

      • Four streams with distinct items # = 28

    • Traffic: the dataset is “Twin Cities’ Traffic data near the 50th St. during the first week of Feb, 2006

      • Three streams with distinct items # = 55

  • Parameter setting

    •  = 0.1minsup

    • Lmax = 10


Mining serial episode rules with time lags over multiple data streams

Conclusions

  • We address the problem of finding significant serial episode rules with time lags over multiple data streams and propose two methods to solve it. TLT is more space-efficient, but LossyDL has high precision

  • In the near future, we will combine these two methods into a hybrid method to investigate the balance between memory space and precision

  • Moreover, we will try to extend the problem of finding serial episode rules to that of finding general episode rules


  • Login