- 95 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' On Demand Classification of Data Streams' - kai-rosario

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### On Demand Classification of Data Streams

Charu C. Aggarwal

Jiawei Han

Philip S. Yu

Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD\'04), Seattle, WA, Aug. 2004

Speaker: Pei-Min Chou

Date:2005/04/01

Outline

- Introduction
- Supervised Micro-cluster
- Snapshot
- Maintenance Supervised Micro-cluster
- Training Data Stream
- Classification on Demand
- Empirical Results

Introduction

- Advances in data storage often grow without limit referred to as data streams
- one-pass mining model

does not recognize the changes and it is too expensive to keep track of the entire history

- static classification model likely to drop when there is a sudden burst
- Our model

simultaneous training and testing streams used for dynamic classification of data sets

Supervised Micro-cluster : Modify Micro-cluster

- Only from training data and each with same class
- Data streams
- Multi-dimensional points with time stamps T1, … Tk ….
- Each point contains d dimensions, i.e.,
- A micro-cluster for n points is defined as a (2*d + 4) tuple:

- - the sum of the squares of the data values
- - the sum of the data values
- - the sum of the squares of the time stamps
- - the sum of the time stamps
- the number of data points
- -variable corresponding to class id corresponds to
- the class label of that micro-cluster

Snapshot

- not too expensive to keep track history
- storing the behavior of the micro-clusters at different moments in time
- if (t mod 2i) = 0 but (t mod 2i+1)!= 0
- reaches max capacity, the oldest snapshot in this frame is removed
- geometric time frame
- vary from 0 to a value no larger than log2(T),

T is the maximum length of the stream

- maximum number

=(max capacity)*log2(T)

Maintenance Supervised Micro-clusters

- Nearest neighbor and k-means algorithms
- The initial micro-clusters is offline process

offline ---answers various user queries based on the stored summary statistics

- When a new data point Xik arrives, it is either added to a micro-cluster, or a new micro-cluster is created

Classification on Demand

- Construct
- Find the correct time-horizon
- The value of kfit
- Large or small horizon be chosen
- Test

Find the correct time-horizon

- Macro-clusters are created over a user-specified time horizon h
- LetS(tc): the snapshot of micro-clusters at time tc

S(tc-h): the snapshot of micro-clusters at time tc-h

- The new set of micro-clusters N(tc-h) are created by subtractingS(tc-h) from S(tc)
- Subtractive property
- Let C1 and C2 be two sets of points such that

Then

Training Data Stream

- A small portion of the stream is used for the process of horizon fitting stream segment
- kfit:the number of points in the data used and the value small as 1% of the data
- remaining portion of the training stream is used for the creation and maintenance of the class-specific micro-clusters

The value of kfit

- Horizon determined classification accuracy
- Process executed periodically for changes
- kfit should be small enough so that the points in it reflect the immediate locality of tc
- Qfit :pre-specified number of time units
- a part of the training stream
- the class labels are known a-priori
- Nearest neighbor procedure (XεQfit)
- Find the closest micro-cluster in N(tc,h) to X
- compare the class label and true label

Large or small horizon be chosen

- The accuracy of all the time horizons which are tracked by the geometric time frame are determined
- The p time horizons which provide the greatest dynamic classification accuracy

by

- First sight ---smallest
- Stable stream ---large

Test

- test stream is a separate process which is executed continuously throughout the algorithm
- Insert Xt , nearest neighbor classication process is applied using each (Xt belong H)
- results in the determination class lable
- these p class labels reported as the relevant class

Empirical Results

- Pentium III,512MB,WinXP
- Both real and synthetic
- Advantage
- much higher classification accuracy
- Good scalability in terms of dimensionality and the number of class labels
- stable processing rate
- Space-efficient

Download Presentation

Connecting to Server..