Mining serial episode rules with time lags over multiple data streams
Download
1 / 18

Mining Serial Episode Rules with Time Lags over Multiple Data Streams - PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on

Mining Serial Episode Rules with Time Lags over Multiple Data Streams. Tung-Ying Lee, En Tzu Wang Dept. of CS, National Tsing Hua Univ. (Taiwan) Arbee L.P. Chen Dept. of CS, National Chengchi Univ. (Taiwan) DaWaK’08. Outline. Introduction Related work Preliminaries

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Mining Serial Episode Rules with Time Lags over Multiple Data Streams' - gazit


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Mining serial episode rules with time lags over multiple data streams

Mining Serial Episode Rules with Time Lags over Multiple Data Streams

Tung-Ying Lee, En Tzu Wang

Dept. of CS, National Tsing Hua Univ. (Taiwan)

Arbee L.P. Chen

Dept. of CS, National Chengchi Univ. (Taiwan)

DaWaK’08


Outline
Outline Data Streams

  • Introduction

  • Related work

  • Preliminaries

    • Support of a serial episode

    • Support/ confidence of a serial episode rule

    • Data structure used in the algorithms

  • Algorithms

    • LossyDL

    • TLT

  • Experiments

  • Conclusions


Introduction
Introduction Data Streams

  • In many applications, data are generated as a form of continuous data streams.

    • Continuously detecting flow and occupancy of a road to qualify the congestion condition of a road forms data streams

    • When roads A and B have heavy traffic, 5 mins later, road C will most likely be congested

    • Regarding the values of flows and occupancies coming from roads as an environment of multi-streams and finding serial episode rules from it

    • Serial episode rules with time lags (SER) : XlagY


Related work
Related Work Data Streams

  • Finding episodes/episode rules from static time series data has been studied for decades

  • Finding episodes over data streams

    • Serial episodes [SSDBM04]

    • Episodes [KDD07]

Precursor

Successor

B

D

A

D

B

A

E

L

C

Serial episode rule

Episode

D

B

A

Serial episode


Preliminaries
Preliminaries Data Streams

  • Environment: a centralized system collecting n synchronized data streams DS1, DS2, …, DSn

    • n-tuple event: a set of items coming from all streams at the same time

    • itemset: a subset of an n-tuple event

    • serial episode: described as an ordered list of itemsets

e.g. serial episode (aA)(bB)

Itemset {gA}

time: 1, 2, 3, 4, 5, 6, 7, 8

DS1: a, b, b, c, g, a, b, f

DS2: A, B, S, G, A, B, A, F

DSn: , , , , , , , 

n-tuple event


Preliminaries (cont.) Data Streams

  • Minimal occurrence: given a serial episode S, a time interval [a, b] is a minimal occurrence of S, if

    • S occurs in [a, b]

    • S does not occur in any proper subintervals of [a, b]

    • If (b-a+1)  T, a time bound given by users, [a, b] is valid

  • MO(S): the set of all minimal occurrences of S

  • Supp(S): the number of valid minimal occurrences of S

Time bound T: 3

DS1

DS2


Preliminaries (cont.) Data Streams

  • A SER is R: S1Lag = LS2

  • Supp(R): |{[a, b]|[a, b]MO(S1)[a, b]: valid  [c, d] MO(S2)[c, d]: valid s.t. (c-a) = L}

  • Conf(R) = Supp(R)/Supp(S1)

4

Time bound T: 3

DS1

DS2


Preliminaries (cont.) Data Streams

  • Problem Formulation: given 4 parameters

    • the maximum time lag (Lmax)

    • the minimum support (minsup)

    • the minimum confidence (minconf)

    • the time bound (T)

  • Find all SERs e.g. R: S1Lag = LS2 satisfying

    • L  Lmax

    • Supp(R)  N  minsup, (N: the number of received n-tuple events)

    • Conf(R)  minconf

    • Calculating supports for serial episodes and SERs must take T into account


Preliminaries (cont.) Data Streams

  • Using the prefix tree for keeping serial episodes

  • S: a serial episode, X: an item

    • S+X: X follows S

    • S+_X: X and the last itemset in S appear at the same time

Level 0

Root

A

B

Level 1

_B

B

Serial episode (AB)

Level 2

Serial episode (A)(B)


B Data Streams

B

[2, 3]

[1, 3]

LossyDL

  • The concept of LossyDL: keeping the valid minimal occurrences of a serial episode for generating rules

Processing C can generate (B)(C): [2, 3] and (BC): [3, 3]

At time point = 3, a 2-tupe event (BC) arrives, T = 3

Each item in the current 2-tuple event needs to be processed (traversing in a bottom-up order)

B

A

[2, 2]

[3, 3]

[1, 1]

The last two minimal occurrences needs to be checked

B

[1, 2]

[1, 3]: not minimal

Using Lossy Counting [VLDB02], whenever N  0 mod 1/, the oldest minimal occurrence is removed


LossyDL (Rule Generation) Data Streams

  • Mining SERs

    • For any two serial episode with supports  (minsup  )  N are checked to see if any minimal occurrences of them can be combined. Then, Supp(R) can be computed

    • For each R: S1Lag = LS2, it will be returned if

      • Supp(R)  (minsup  )  N, and

      • (Supp(R) + N)/Supp(S1) minconf


TLT Data Streams

  • A lot of minimal occurrences are kept in LossyDL, but only the last two are used while updating

    • Keeping supports instead of the minimal occurrence lists

    • How to generate rules without the minimal occurrence lists?

    • Re: using the following observations to prune the insignificant rules

  • Observations

    • XL(AB) and XLA, obviously Supp(XLA)  Supp(XL(AB)): XL(AB) is not significant if XLA does not satisfy one of minsup and minconf

    • (AB)L(CD) and ALC, obviously Supp(ALC)  Supp((AB)L(CD)): (AB)L(CD) is not significant if Supp(ALC) < Supp(AB)  minconf


TLT (cont.) Data Streams

  • Observations (cont.):

    • Given a SER: (A)(B)5(CD), and T = 3

      • A1B or A2B, that is ApB, 0<p< T (T1 types)

      • A1B4(CD), A2B3(CD), that is ApBLp(CD)

      • Supp(ApBLp(CD))  min(Supp(ApB), Supp(BLpC))

      • (A)(B)5(CD) is not significant, if

        • pmin(Supp(ApB), Supp(BLpC)) < Supp(A)(B)  minconf

  • Using the observations to prune insignificant rules

  • Time lag table (TLT)

    • ALB is a reduced SER, if A and B are single items

    • For finding S1LmaxS2, the reduced SERs having a time lag at most Lmax+T1 (from the first itemset of precursor to the last itemset of successor)

    • Using Lmax+T1 Time Lag Tables to keep the supports of reduced SER


TLT (cont.) Data Streams

  • The support and the last two minimal occurrences of an serial episode are kept in the prefix tree

    • Keeping supports instead of keeping minimal occurrence lists

    • Keeping the last two minimal occurrences for updating the supports

    • WheneverN  0 mod 1/, all supports are decreased by 1

  • In addition, the last Lmax+T1 n-tuple events are kept for updating the Time Lag Tables


TLT (Rule Generation) Data Streams

  • Mining SERs

    • Any two serial episode with supports  (minsup  )  N form the candidate SERs

      • A candidate SER will be returned if it can pass the pruning rules from the above observations


Experiments Data Streams

  • Two real dataset

    • PDOMEI: the dataset contains the dryness and climate indices derived by experts, usually used to predict droughts

      • Four streams with distinct items # = 28

    • Traffic: the dataset is “Twin Cities’ Traffic data near the 50th St. during the first week of Feb, 2006

      • Three streams with distinct items # = 55

  • Parameter setting

    •  = 0.1minsup

    • Lmax = 10


Conclusions Data Streams

  • We address the problem of finding significant serial episode rules with time lags over multiple data streams and propose two methods to solve it. TLT is more space-efficient, but LossyDL has high precision

  • In the near future, we will combine these two methods into a hybrid method to investigate the balance between memory space and precision

  • Moreover, we will try to extend the problem of finding serial episode rules to that of finding general episode rules


ad