- 79 Views
- Uploaded on
- Presentation posted in: General

On Demand Classification of Data Streams

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

On Demand Classification of Data Streams

Charu C. Aggarwal

Jiawei Han

Philip S. Yu

Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD'04), Seattle, WA, Aug. 2004

Speaker: Pei-Min Chou

Date:2005/04/01

- Introduction
- Supervised Micro-cluster
- Snapshot
- Maintenance Supervised Micro-cluster
- Training Data Stream
- Classification on Demand
- Empirical Results

- Advances in data storage often grow without limit referred to as data streams
- one-pass mining model
does not recognize the changes and it is too expensive to keep track of the entire history

- static classification model likely to drop when there is a sudden burst
- Our model
simultaneous training and testing streams used for dynamic classification of data sets

- Only from training data and each with same class
- Data streams
- Multi-dimensional points with time stamps T1, … Tk ….
- Each point contains d dimensions, i.e.,

- A micro-cluster for n points is defined as a (2*d + 4) tuple:

- - the sum of the squares of the data values
- - the sum of the data values
- - the sum of the squares of the time stamps
- - the sum of the time stamps
- the number of data points
- -variable corresponding to class id corresponds to
- the class label of that micro-cluster

- not too expensive to keep track history
- storing the behavior of the micro-clusters at different moments in time
- if (t mod 2i) = 0 but (t mod 2i+1)!= 0
- reaches max capacity, the oldest snapshot in this frame is removed
- geometric time frame
- vary from 0 to a value no larger than log2(T),
T is the maximum length of the stream

- maximum number
=(max capacity)*log2(T)

- vary from 0 to a value no larger than log2(T),

- Nearest neighbor and k-means algorithms
- The initial micro-clusters is offline process
offline ---answers various user queries based on the stored summary statistics

- When a new data point Xik arrives, it is either added to a micro-cluster, or a new micro-cluster is created

- Construct
- Find the correct time-horizon
- The value of kfit
- Large or small horizon be chosen

- Test

- Macro-clusters are created over a user-specified time horizon h
- LetS(tc): the snapshot of micro-clusters at time tc
S(tc-h): the snapshot of micro-clusters at time tc-h

- The new set of micro-clusters N(tc-h) are created by subtractingS(tc-h) from S(tc)
- Subtractive property
- Let C1 and C2 be two sets of points such that
Then

- Let C1 and C2 be two sets of points such that

- A small portion of the stream is used for the process of horizon fitting stream segment
- kfit:the number of points in the data used and the value small as 1% of the data

- remaining portion of the training stream is used for the creation and maintenance of the class-specific micro-clusters

- Horizon determined classification accuracy
- Process executed periodically for changes
- kfit should be small enough so that the points in it reflect the immediate locality of tc
- Qfit :pre-specified number of time units
- a part of the training stream
- the class labels are known a-priori

- Nearest neighbor procedure (XεQfit)
- Find the closest micro-cluster in N(tc,h) to X
- compare the class label and true label

- The accuracy of all the time horizons which are tracked by the geometric time frame are determined
- The p time horizons which provide the greatest dynamic classification accuracy
by

- First sight ---smallest
- Stable stream ---large

- test stream is a separate process which is executed continuously throughout the algorithm
- Insert Xt , nearest neighbor classication process is applied using each (Xt belong H)
- results in the determination class lable
- these p class labels reported as the relevant class

- Pentium III,512MB,WinXP
- Both real and synthetic
- Advantage
- much higher classification accuracy
- Good scalability in terms of dimensionality and the number of class labels
- stable processing rate
- Space-efficient