1 / 46

# Mining of Frequent Patterns from Sensor Data - PowerPoint PPT Presentation

Mining of Frequent Patterns from Sensor Data Presented by: Ivy Tong Suk Man Supervisor: Dr. B C M Kao 20 August, 2003 Outline Outline of the Presentation Motivation Problem Definition Algorithm Apriori with data transformation Interval-List Apriori Experimental Results Conclusion

Related searches for Mining of Frequent Patterns from Sensor Data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Mining of Frequent Patterns from Sensor Data' - issac

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Mining of Frequent Patterns from Sensor Data

Presented by: Ivy Tong Suk Man

Supervisor: Dr. B C M Kao

20 August, 2003

• Outline of the Presentation

• Motivation

• Problem Definition

• Algorithm

• Apriori with data transformation

• Interval-List Apriori

• Experimental Results

• Conclusion

25ºC

27ºC

28ºC

26ºC

t

0

1

5

10

Motivation

• Continuous items

• reflect values from an entity that changes continuously in the external environment.

• Update  Change of state of the real entity

• Initial temperature: 25ºC at t=0s

• Sequence of updates: <timestamp, new_temp>

<1s, 27ºC>, <5s, 28ºC>, <10s, 26ºC>, <14s,..> …

• t=0s to 1s, 25ºC

t=1s to 5s, 27ºC

t=5s to 10s, 28ºC

• What is the average temperature from t=0s to 10s?

• Ans: (25x1+27x4+28x5)/10 = 27.3ºC

• Time is a component in some applications

• E.g. stock price quotes, network traffic data

• “Sensors” are used to monitor some conditions, for example:

• Prices of stocks: by getting quotations from a finance website

• Weather: measuring temperature, humidity, air pressure, wind, etc.

• We want to find correlations of the readings among a set of sensors

• Goal: To mine association rules from sensor data

• How different is it from mining association rules from market basket data?

• Time component

When searching for association rules in market basket data, time field is usually ignored as there is no temporal correlation between the transactions

• Streaming data

Data arrives continuously, possibly infinitely, and in large volume

Notations

• We have a set of sensors R = {r1,r2,…,rm}

• Each sensor ri has a set of numerical states Vi

• Assume binary states for all sensors

• Vi = {0,1} i s.t. ri R

• Dataset D: a sequence of updates of sensor state in the form of <ts, ri, vi> where ri R, vi Vi

• ts : timestamp of the update

• ri: sensor to be updated

• vi: new value of the state of ri

• For sensors with binary states

• update in form of <ts, ri> as the new state can be inferred by toggling the old state

• R={A,B,C,D,E,F}

• Initial states: all off

• D:

<1,A>

<2,B>

<4,D>

<5,A>

<6,E>

<7,F>

<8,E>

<10,A>

<11,F>

<13,C>

A

t

0

1

5

10

B

t

2

C

t

13

D

t

4

E

t

6

8

F

t

7

11

• An association rule is a rule, satisfying certain support and confidence restrictions, in the form X  Ywhere XR, YR and XY=

• Association rule X  Y has confidence c,

In c % of the time when the sensors in X are ON (with state = 1), the sensors in Y are ON

• Association rule X  Y has support s,

In s% of the total length of history, the sensors in X and Y are ON

• TLS(X) denote Total LifeSpan of X

• Total length of time that the sensors in X are ON

• T – total length of history

• Sup(X) = TLS(X)/T

Conf(X  Y) = Sup(X U Y) / Sup(X)

• Example:

T = 15s

TLS(A)=9, TLS(AB)=8

Sup(A) = 9/15 = 60%

Sup(AB) =8/15 = 53%

Conf(A->B) = 8/9 = 89%

A

t

0

1

5

10

B

t

2

• Transform & Apriori

• Transform the sequence of updates to the form of market basket data

• At each point of update

• take a snapshot of the states of all sensors

• Output all sensors with state=on as a transaction

• Attach

Weight(transaction)

= Lifespan(this update)

= timestamp(next update) – timestamp(this update)

D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C>

Algorithm A - Example

A

t

0

1

5

10

B

t

2

Transformed database D’:

C

t

13

D

t

4

E

t

6

8

F

t

7

11

D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C>

Algorithm A - Example

A

t

0

1

5

10

B

t

2

Transformed database D’:

C

t

13

D

t

timestamp=1

4

E

t

6

8

F

t

7

11

timestamp=1

D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C>

Algorithm A - Example

A

t

0

1

5

10

B

t

2

Transformed database D’:

C

t

13

D

t

timestamp=1

4

timestamp=2

E

t

6

8

F

t

7

11

timestamp=2

D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C>

Algorithm A - Example

A

t

0

1

5

10

B

t

2

Transformed database D’:

C

t

13

D

t

4

timestamp=2

E

t

6

8

timestamp=4

F

t

7

11

timestamp=4

D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C>

Algorithm A - Example

A

t

0

1

5

10

B

t

2

Transformed database D’:

C

t

13

D

t

4

E

t

6

8

F

t

7

11

End of history = 15s

timestamp=13

D: <1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>, <11,F>,<13,C>

Algorithm A - Example

A

t

0

1

5

10

B

t

2

Transformed database D’:

C

t

13

D

t

4

E

t

6

8

F

t

7

11

• Apply Apriori on the transformed dataset D’

• Drawbacks:

• A lot of redundancy

• Adjacent transactions may be very similar, differed by the one sensor with state update

• Interval-List Apriori

• Uses an “interval-list” format

• <X, interval1, interval2, interval3, … >

where intervali is the interval in which all sensors in X are on.

• TLS(X) =  (intervali.h – intervali.l)

• Example:

A

t

0

1

5

10

<A, [1,5), [10,15)> TLS(A) = (5-1) + (15-10) = 9

• Step 1:

For each ri R,

build a list of interval in which ri is ON by scanning the sequence of updates

• Calculate the TLS of each ri

• If TLS(ri)  min_sup, put ri into L1

• Initial states: all off

• D:

<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>

• <A, empty>

• <B, empty>

• <C, empty>

• <D, empty>

• <E, empty>

• <F, empty>

• Initial states: all off

• D:

<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>

• <A, [1,?)>

• <B, empty>

• <C, empty>

• <D, empty>

• <E, empty>

• <F, empty>

• Initial states: all off

• D:

<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>

• <A, [1,?)>

• <B, [2,?)>

• <C, empty>

• <D, empty>

• <E, empty>

• <F, empty>

• Initial states: all off

• D:

<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>

• <A, [1,5)>

• <B, [2,?)>

• <C, empty>

• <D, [4,?)>

• <E, empty>

• <F, empty>

• Initial states: all off

• D:

<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>

• <A, [1,5),[10,?)>

• <B, [2,?)>

• <C, [13,?)>

• <D, [4,?)>

• <E, [6,8)>

• <F, [7,11)>

• Initial states: all off

• D:

<1,A>,<2,B>,<4,D>,<5,A>, <6,E>,<7,F>,<8,E>,<10,A>,<11,F>,<13,C>

• <A, [1,5),[10,15)>

• <B, [2,15)>

• <C, [13,15)>

• <D, [4,15)>

• <E, [6,8)>

• <F, [7,11)>

End of history T =15s

• Step 2:

• Find all larger frequent sensor-sets

• Similar to Apriori Frequent Itemst Property

• Any subset of a frequent sensor-set must be frequent.

• Method:

• Generate candidates of size i+1 from frequent sensor-sets of size i.

• Approach used: join to obtain sensor-sets of size i+1 if two size-i frequent sensor-sets agree on i-1

• May also prune candidates who have subsets that are not large.

• Count the support by merging (intersection of) the interval lists of the two size-i frequent sensor-sets

• If sup  min_sup, put into Li+1

• Repeat the process until the candidate set is empty

• Example:

• <A, [1,5), [10,15)>

• <B, [2,15)>

• <AB, [2,5),[10,15)>

A

t

0

1

5

10

B

t

2

T=15

C

D

E

F

A

B

LS:2

LS:11

LS:2

LS:4

LS:13

LS:9

AB

AF

BF

BD

LS:1

LS:4

LS:11

LS:6

LS:8

ABD

Min support count: 3

LS:6

• When generating a candidate sensor-set C of size i from two size i-1 sensor-sets LA and LB (subsets of C), we also construct the interval list of C by intersecting the interval lists of LA and LB.

• Joining the two interval lists (of length m and n) is a key step in our algorithm

• Use simple linear scan requires O(m+n) time

• There are i different size i-1 subset of C

which two to pick?

• Method 1:

• Choose two lists with fewest no of intervals

• =>Store no of intervals for each sensor-set

• Method 2:

• Choose two lists with smallest count (TLS)

• Intuitively shorter lifespan implies fewer intervals

• Easier to implement

• Have the lifespan when checking if the sensor-set is frequent

• Data generation

• Stimulate data generated by a set of n binary sensors

• Make use of a standard market basket data

• With n sensors, each of which can be either on or off

=>2n possible combination of sensor states

• Assign a probability to each of the combinations

• How to assign the probabilities?

• Let N be the no of occurrences of the transaction in the market basket that contains exactly only the sensors that are ON

• E.g. Consider R={A,B,C,D,E,F}

• Suppose we want to assign prob to the sensor state AC (only A and C are ON)

• N is no of transactions that contain exactly only A and C

• Assign prob = N/|D|, where |D| is the size of the market basket dataset

• Note: Need sufficiently large market basket data

• transactions that occur very infrequently will not be given ZERO probability

• Generating sensor set data

• Choose the initial state (at t=0s)

• Randomly

• According to the probabilities assigned

• Pick the combination with highest probability assigned

=> first sensor set states

• What is the next set of sensor-set states?

• For simplicity, in our model, only one sensor can be updated at a time

• For any two adjacent updates, the sensor-set states at the two time instants are differed by only one sensor

=> change only one sensor state

=> n possible combinations by toggling each of the n sensor states

• We normalize the probabilities of the n combinations by their sum

• Pick the next set of sensor-set states according to the normalized probabilities

• Inter-arrival time of updates: exponential distribution

• 8,000,000 transactions

• 100 items

• number of maximal potentially large itemsets = 2000

• average transaction length: 10

• average length of maximal large itemsets: 4

• length of the maximal large itemsets: 11

• minimum support: 0.05%

• length of the maximal large itemsets: ?

• Algorithms:

• Apriori: cached mode

• IL-apriori:

• (a) random-join (IL-apriori)

• (b) join-by-smallest lifespan (IL-apriori-S)

• (c) join-by-fewest-no-of-intervals (IL-apriori-C)

• Performance of algorithms (larger support):

• All IL-apriori algorithms outperform cache apriori

• Performance (lower support):

• More candidates => IL-apriori: Expensive to join interval lists

• More long frequent sensor-sets

• Apriori has to match the candidates by search through the DB

• IL-apriori-C and IL-apriori-S reduce a lot of time in joining the lists

• Amounts of memory usage - peak memory usage

• Cache apriori - store the whole database

• IL-apriori – store a lot of interval lists when no of candidates is growing large

(min_sup = 0.02%)

• Apriori is faster in the first 3 passes

• Running time for IL-apriori drops sharply after

• Apriori has to scan over the whole database

• IL-apriori (C/S) needs to join relatively short interval-lists in later passes

(min_sup = 0.02%)

• Memory requirement for IL-apriori is a lot higher when there are more frequent sensor-set interval lists to join

(min_sup = 0.05%)

• Runtime for all algorithms increases linearly with total number of transactions

(min_sup = 0.05%)

• Memory required by all algorithms increases as no of transactions increases.

• Rate of increase in IL-apriori is faster

• Interval-list method to mine sensor data is described

• Two interval list joining strategies are quite effective in reducing running time

• Memory requirement is quite high

• Future Work

• Other methods for joining interval-lists

• Trade-off between time and space

• Extending to the streaming case

• Consider approaches other than Lossy Counting Algorithms (Manku, and R. Motwani, VLDB’02)