Mining Serial Episode Rules with Time Lags over Multiple Data Streams

1 / 18

# Mining Serial Episode Rules with Time Lags over Multiple Data Streams - PowerPoint PPT Presentation

Mining Serial Episode Rules with Time Lags over Multiple Data Streams. Tung-Ying Lee, En Tzu Wang Dept. of CS, National Tsing Hua Univ. (Taiwan) Arbee L.P. Chen Dept. of CS, National Chengchi Univ. (Taiwan) DaWaK’08. Outline. Introduction Related work Preliminaries

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Mining Serial Episode Rules with Time Lags over Multiple Data Streams' - gazit

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Mining Serial Episode Rules with Time Lags over Multiple Data Streams

Tung-Ying Lee, En Tzu Wang

Dept. of CS, National Tsing Hua Univ. (Taiwan)

Arbee L.P. Chen

Dept. of CS, National Chengchi Univ. (Taiwan)

DaWaK’08

Outline
• Introduction
• Related work
• Preliminaries
• Support of a serial episode
• Support/ confidence of a serial episode rule
• Data structure used in the algorithms
• Algorithms
• LossyDL
• TLT
• Experiments
• Conclusions
Introduction
• In many applications, data are generated as a form of continuous data streams.
• Continuously detecting flow and occupancy of a road to qualify the congestion condition of a road forms data streams
• When roads A and B have heavy traffic, 5 mins later, road C will most likely be congested
• Regarding the values of flows and occupancies coming from roads as an environment of multi-streams and finding serial episode rules from it
• Serial episode rules with time lags (SER) : XlagY
Related Work
• Finding episodes/episode rules from static time series data has been studied for decades
• Finding episodes over data streams
• Serial episodes [SSDBM04]
• Episodes [KDD07]

Precursor

Successor

B

D

A

D

B

A

E

L

C

Serial episode rule

Episode

D

B

A

Serial episode

Preliminaries
• Environment: a centralized system collecting n synchronized data streams DS1, DS2, …, DSn
• n-tuple event: a set of items coming from all streams at the same time
• itemset: a subset of an n-tuple event
• serial episode: described as an ordered list of itemsets

e.g. serial episode (aA)(bB)

Itemset {gA}

time: 1, 2, 3, 4, 5, 6, 7, 8

DS1: a, b, b, c, g, a, b, f

DS2: A, B, S, G, A, B, A, F

DSn: , , , , , , , 

n-tuple event

Preliminaries (cont.)
• Minimal occurrence: given a serial episode S, a time interval [a, b] is a minimal occurrence of S, if
• S occurs in [a, b]
• S does not occur in any proper subintervals of [a, b]
• If (b-a+1)  T, a time bound given by users, [a, b] is valid
• MO(S): the set of all minimal occurrences of S
• Supp(S): the number of valid minimal occurrences of S

Time bound T: 3

DS1

DS2

Preliminaries (cont.)
• A SER is R: S1Lag = LS2
• Supp(R): |{[a, b]|[a, b]MO(S1)[a, b]: valid  [c, d] MO(S2)[c, d]: valid s.t. (c-a) = L}
• Conf(R) = Supp(R)/Supp(S1)

4

Time bound T: 3

DS1

DS2

Preliminaries (cont.)
• Problem Formulation: given 4 parameters
• the maximum time lag (Lmax)
• the minimum support (minsup)
• the minimum confidence (minconf)
• the time bound (T)
• Find all SERs e.g. R: S1Lag = LS2 satisfying
• L  Lmax
• Supp(R)  N  minsup, (N: the number of received n-tuple events)
• Conf(R)  minconf
• Calculating supports for serial episodes and SERs must take T into account
Preliminaries (cont.)
• Using the prefix tree for keeping serial episodes
• S: a serial episode, X: an item
• S+X: X follows S
• S+_X: X and the last itemset in S appear at the same time

Level 0

Root

A

B

Level 1

_B

B

Serial episode (AB)

Level 2

Serial episode (A)(B)

B

B

[2, 3]

[1, 3]

LossyDL

• The concept of LossyDL: keeping the valid minimal occurrences of a serial episode for generating rules

Processing C can generate (B)(C): [2, 3] and (BC): [3, 3]

At time point = 3, a 2-tupe event (BC) arrives, T = 3

Each item in the current 2-tuple event needs to be processed (traversing in a bottom-up order)

B

A

[2, 2]

[3, 3]

[1, 1]

The last two minimal occurrences needs to be checked

B

[1, 2]

[1, 3]: not minimal

Using Lossy Counting [VLDB02], whenever N  0 mod 1/, the oldest minimal occurrence is removed

LossyDL (Rule Generation)
• Mining SERs
• For any two serial episode with supports  (minsup  )  N are checked to see if any minimal occurrences of them can be combined. Then, Supp(R) can be computed
• For each R: S1Lag = LS2, it will be returned if
• Supp(R)  (minsup  )  N, and
• (Supp(R) + N)/Supp(S1) minconf
TLT
• A lot of minimal occurrences are kept in LossyDL, but only the last two are used while updating
• Keeping supports instead of the minimal occurrence lists
• How to generate rules without the minimal occurrence lists?
• Re: using the following observations to prune the insignificant rules
• Observations
• XL(AB) and XLA, obviously Supp(XLA)  Supp(XL(AB)): XL(AB) is not significant if XLA does not satisfy one of minsup and minconf
• (AB)L(CD) and ALC, obviously Supp(ALC)  Supp((AB)L(CD)): (AB)L(CD) is not significant if Supp(ALC) < Supp(AB)  minconf
TLT (cont.)
• Observations (cont.):
• Given a SER: (A)(B)5(CD), and T = 3
• A1B or A2B, that is ApB, 0
• A1B4(CD), A2B3(CD), that is ApBLp(CD)
• Supp(ApBLp(CD))  min(Supp(ApB), Supp(BLpC))
• (A)(B)5(CD) is not significant, if
• pmin(Supp(ApB), Supp(BLpC)) < Supp(A)(B)  minconf
• Using the observations to prune insignificant rules
• Time lag table (TLT)
• ALB is a reduced SER, if A and B are single items
• For finding S1LmaxS2, the reduced SERs having a time lag at most Lmax+T1 (from the first itemset of precursor to the last itemset of successor)
• Using Lmax+T1 Time Lag Tables to keep the supports of reduced SER
TLT (cont.)
• The support and the last two minimal occurrences of an serial episode are kept in the prefix tree
• Keeping supports instead of keeping minimal occurrence lists
• Keeping the last two minimal occurrences for updating the supports
• WheneverN  0 mod 1/, all supports are decreased by 1
• In addition, the last Lmax+T1 n-tuple events are kept for updating the Time Lag Tables
TLT (Rule Generation)
• Mining SERs
• Any two serial episode with supports  (minsup  )  N form the candidate SERs
• A candidate SER will be returned if it can pass the pruning rules from the above observations
Experiments
• Two real dataset
• PDOMEI: the dataset contains the dryness and climate indices derived by experts, usually used to predict droughts
• Four streams with distinct items # = 28
• Traffic: the dataset is “Twin Cities’ Traffic data near the 50th St. during the first week of Feb, 2006
• Three streams with distinct items # = 55
• Parameter setting
•  = 0.1minsup
• Lmax = 10
Conclusions
• We address the problem of finding significant serial episode rules with time lags over multiple data streams and propose two methods to solve it. TLT is more space-efficient, but LossyDL has high precision
• In the near future, we will combine these two methods into a hybrid method to investigate the balance between memory space and precision
• Moreover, we will try to extend the problem of finding serial episode rules to that of finding general episode rules