Mining Serial Episode Rules with Time Lags over Multiple Data Streams

Download Presentation

Mining Serial Episode Rules with Time Lags over Multiple Data Streams

Loading in 2 Seconds...

- 85 Views
- Uploaded on
- Presentation posted in: General

Mining Serial Episode Rules with Time Lags over Multiple Data Streams

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Mining Serial Episode Rules with Time Lags over Multiple Data Streams

Tung-Ying Lee, En Tzu Wang

Dept. of CS, National Tsing Hua Univ. (Taiwan)

Arbee L.P. Chen

Dept. of CS, National Chengchi Univ. (Taiwan)

DaWaK’08

- Introduction
- Related work
- Preliminaries
- Support of a serial episode
- Support/ confidence of a serial episode rule
- Data structure used in the algorithms

- Algorithms
- LossyDL
- TLT

- Experiments
- Conclusions

- In many applications, data are generated as a form of continuous data streams.
- Continuously detecting flow and occupancy of a road to qualify the congestion condition of a road forms data streams
- When roads A and B have heavy traffic, 5 mins later, road C will most likely be congested
- Regarding the values of flows and occupancies coming from roads as an environment of multi-streams and finding serial episode rules from it
- Serial episode rules with time lags (SER) : XlagY

- Finding episodes/episode rules from static time series data has been studied for decades
- Finding episodes over data streams
- Serial episodes [SSDBM04]
- Episodes [KDD07]

Precursor

Successor

B

D

A

D

B

A

E

L

C

Serial episode rule

Episode

D

B

A

Serial episode

- Environment: a centralized system collecting n synchronized data streams DS1, DS2, …, DSn
- n-tuple event: a set of items coming from all streams at the same time
- itemset: a subset of an n-tuple event
- serial episode: described as an ordered list of itemsets

e.g. serial episode (aA)(bB)

Itemset {gA}

time: 1, 2, 3, 4, 5, 6, 7, 8

DS1: a, b, b, c, g, a, b, f

DS2: A, B, S, G, A, B, A, F

DSn: , , , , , , ,

…

n-tuple event

Preliminaries (cont.)

- Minimal occurrence: given a serial episode S, a time interval [a, b] is a minimal occurrence of S, if
- S occurs in [a, b]
- S does not occur in any proper subintervals of [a, b]
- If (b-a+1) T, a time bound given by users, [a, b] is valid

- MO(S): the set of all minimal occurrences of S
- Supp(S): the number of valid minimal occurrences of S

Time bound T: 3

DS1

DS2

Preliminaries (cont.)

- A SER is R: S1Lag = LS2
- Supp(R): |{[a, b]|[a, b]MO(S1)[a, b]: valid [c, d] MO(S2)[c, d]: valid s.t. (c-a) = L}
- Conf(R) = Supp(R)/Supp(S1)

4

Time bound T: 3

DS1

DS2

Preliminaries (cont.)

- Problem Formulation: given 4 parameters
- the maximum time lag (Lmax)
- the minimum support (minsup)
- the minimum confidence (minconf)
- the time bound (T)

- Find all SERs e.g. R: S1Lag = LS2 satisfying
- L Lmax
- Supp(R) N minsup, (N: the number of received n-tuple events)
- Conf(R) minconf
- Calculating supports for serial episodes and SERs must take T into account

Preliminaries (cont.)

- Using the prefix tree for keeping serial episodes
- S: a serial episode, X: an item
- S+X: X follows S
- S+_X: X and the last itemset in S appear at the same time

Level 0

Root

A

B

Level 1

_B

B

Serial episode (AB)

Level 2

Serial episode (A)(B)

B

B

[2, 3]

[1, 3]

LossyDL

- The concept of LossyDL: keeping the valid minimal occurrences of a serial episode for generating rules

Processing C can generate (B)(C): [2, 3] and (BC): [3, 3]

At time point = 3, a 2-tupe event (BC) arrives, T = 3

Each item in the current 2-tuple event needs to be processed (traversing in a bottom-up order)

B

A

[2, 2]

[3, 3]

[1, 1]

The last two minimal occurrences needs to be checked

B

[1, 2]

[1, 3]: not minimal

Using Lossy Counting [VLDB02], whenever N 0 mod 1/, the oldest minimal occurrence is removed

LossyDL (Rule Generation)

- Mining SERs
- For any two serial episode with supports (minsup ) N are checked to see if any minimal occurrences of them can be combined. Then, Supp(R) can be computed
- For each R: S1Lag = LS2, it will be returned if
- Supp(R) (minsup ) N, and
- (Supp(R) + N)/Supp(S1) minconf

TLT

- A lot of minimal occurrences are kept in LossyDL, but only the last two are used while updating
- Keeping supports instead of the minimal occurrence lists
- How to generate rules without the minimal occurrence lists?
- Re: using the following observations to prune the insignificant rules

- Observations
- XL(AB) and XLA, obviously Supp(XLA) Supp(XL(AB)): XL(AB) is not significant if XLA does not satisfy one of minsup and minconf
- (AB)L(CD) and ALC, obviously Supp(ALC) Supp((AB)L(CD)): (AB)L(CD) is not significant if Supp(ALC) < Supp(AB) minconf

TLT (cont.)

- Observations (cont.):
- Given a SER: (A)(B)5(CD), and T = 3
- A1B or A2B, that is ApB, 0<p< T (T1 types)
- A1B4(CD), A2B3(CD), that is ApBLp(CD)
- Supp(ApBLp(CD)) min(Supp(ApB), Supp(BLpC))
- (A)(B)5(CD) is not significant, if
- pmin(Supp(ApB), Supp(BLpC)) < Supp(A)(B) minconf

- Given a SER: (A)(B)5(CD), and T = 3
- Using the observations to prune insignificant rules
- Time lag table (TLT)
- ALB is a reduced SER, if A and B are single items
- For finding S1LmaxS2, the reduced SERs having a time lag at most Lmax+T1 (from the first itemset of precursor to the last itemset of successor)
- Using Lmax+T1 Time Lag Tables to keep the supports of reduced SER

TLT (cont.)

- The support and the last two minimal occurrences of an serial episode are kept in the prefix tree
- Keeping supports instead of keeping minimal occurrence lists
- Keeping the last two minimal occurrences for updating the supports
- WheneverN 0 mod 1/, all supports are decreased by 1

- In addition, the last Lmax+T1 n-tuple events are kept for updating the Time Lag Tables

TLT (Rule Generation)

- Mining SERs
- Any two serial episode with supports (minsup ) N form the candidate SERs
- A candidate SER will be returned if it can pass the pruning rules from the above observations

- Any two serial episode with supports (minsup ) N form the candidate SERs

Experiments

- Two real dataset
- PDOMEI: the dataset contains the dryness and climate indices derived by experts, usually used to predict droughts
- Four streams with distinct items # = 28

- Traffic: the dataset is “Twin Cities’ Traffic data near the 50th St. during the first week of Feb, 2006
- Three streams with distinct items # = 55

- PDOMEI: the dataset contains the dryness and climate indices derived by experts, usually used to predict droughts
- Parameter setting
- = 0.1minsup
- Lmax = 10

Conclusions

- We address the problem of finding significant serial episode rules with time lags over multiple data streams and propose two methods to solve it. TLT is more space-efficient, but LossyDL has high precision
- In the near future, we will combine these two methods into a hybrid method to investigate the balance between memory space and precision
- Moreover, we will try to extend the problem of finding serial episode rules to that of finding general episode rules