on demand classification of data streams
Download
Skip this Video
Download Presentation
On Demand Classification of Data Streams

Loading in 2 Seconds...

play fullscreen
1 / 16

On Demand Classification of Data Streams - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

On Demand Classification of Data Streams. Charu C. Aggarwal Jiawei Han Philip S. Yu. Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD\'04), Seattle, WA, Aug. 2004. Speaker: Pei-Min Chou Date:2005/04/01. Outline. Introduction Supervised Micro-cluster Snapshot

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' On Demand Classification of Data Streams' - kai-rosario


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
on demand classification of data streams

On Demand Classification of Data Streams

Charu C. Aggarwal

Jiawei Han

Philip S. Yu

Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD\'04), Seattle, WA, Aug. 2004

Speaker: Pei-Min Chou

Date:2005/04/01

outline
Outline
  • Introduction
  • Supervised Micro-cluster
  • Snapshot
  • Maintenance Supervised Micro-cluster
  • Training Data Stream
  • Classification on Demand
  • Empirical Results
introduction
Introduction
  • Advances in data storage often grow without limit referred to as data streams
  • one-pass mining model

does not recognize the changes and it is too expensive to keep track of the entire history

  • static classification model likely to drop when there is a sudden burst
  • Our model

simultaneous training and testing streams used for dynamic classification of data sets

supervised micro cluster modify micro cluster
Supervised Micro-cluster : Modify Micro-cluster
  • Only from training data and each with same class
  • Data streams
    • Multi-dimensional points with time stamps T1, … Tk ….
    • Each point contains d dimensions, i.e.,
  • A micro-cluster for n points is defined as a (2*d + 4) tuple:
  • - the sum of the squares of the data values
  • - the sum of the data values
  • - the sum of the squares of the time stamps
  • - the sum of the time stamps
  • the number of data points
  • -variable corresponding to class id corresponds to
  • the class label of that micro-cluster
snapshot
Snapshot
  • not too expensive to keep track history
  • storing the behavior of the micro-clusters at different moments in time
  • if (t mod 2i) = 0 but (t mod 2i+1)!= 0
  • reaches max capacity, the oldest snapshot in this frame is removed
  • geometric time frame
    • vary from 0 to a value no larger than log2(T),

T is the maximum length of the stream

    • maximum number

=(max capacity)*log2(T)

maintenance supervised micro clusters
Maintenance Supervised Micro-clusters
  • Nearest neighbor and k-means algorithms
  • The initial micro-clusters is offline process

offline ---answers various user queries based on the stored summary statistics

  • When a new data point Xik arrives, it is either added to a micro-cluster, or a new micro-cluster is created
classification on demand
Classification on Demand
  • Construct
    • Find the correct time-horizon
    • The value of kfit
    • Large or small horizon be chosen
  • Test
find the correct time horizon
Find the correct time-horizon
  • Macro-clusters are created over a user-specified time horizon h
  • LetS(tc): the snapshot of micro-clusters at time tc

S(tc-h): the snapshot of micro-clusters at time tc-h

  • The new set of micro-clusters N(tc-h) are created by subtractingS(tc-h) from S(tc)
  • Subtractive property
    • Let C1 and C2 be two sets of points such that

Then

training data stream
Training Data Stream
  • A small portion of the stream is used for the process of horizon fitting stream segment
    • kfit:the number of points in the data used and the value small as 1% of the data
  • remaining portion of the training stream is used for the creation and maintenance of the class-specific micro-clusters
the value of k fit
The value of kfit
  • Horizon determined classification accuracy
  • Process executed periodically for changes
  • kfit should be small enough so that the points in it reflect the immediate locality of tc
  • Qfit :pre-specified number of time units
    • a part of the training stream
    • the class labels are known a-priori
  • Nearest neighbor procedure (XεQfit)
    • Find the closest micro-cluster in N(tc,h) to X
    • compare the class label and true label
large or small horizon be chosen
Large or small horizon be chosen
  • The accuracy of all the time horizons which are tracked by the geometric time frame are determined
  • The p time horizons which provide the greatest dynamic classification accuracy

by

  • First sight ---smallest
  • Stable stream ---large
slide12
Test
  • test stream is a separate process which is executed continuously throughout the algorithm
  • Insert Xt , nearest neighbor classication process is applied using each (Xt belong H)
    • results in the determination class lable
    • these p class labels reported as the relevant class
empirical results
Empirical Results
  • Pentium III,512MB,WinXP
  • Both real and synthetic
  • Advantage
    • much higher classification accuracy
    • Good scalability in terms of dimensionality and the number of class labels
    • stable processing rate
    • Space-efficient
ad