a framework for clustering evolving data streams
Download
Skip this Video
Download Presentation
A Framework for Clustering Evolving Data Streams

Loading in 2 Seconds...

play fullscreen
1 / 21

A Framework for Clustering Evolving Data Streams - PowerPoint PPT Presentation


  • 197 Views
  • Uploaded on

A Framework for Clustering Evolving Data Streams. Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad. Outline. Background of Clustering Motivation for Clustering over Streaming Data. Overall Solution Micro Clusters Pyramid Time Frame

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Framework for Clustering Evolving Data Streams' - duc


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a framework for clustering evolving data streams

A Framework for Clustering Evolving Data Streams

Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu

Presented by: Di Yang

Charudatta Wad

outline
Outline
  • Background of Clustering
  • Motivation for Clustering over Streaming Data.
  • Overall Solution
  • Micro Clusters
  • Pyramid Time Frame
  • Macro Cluster
  • Cluster Maintenance
background of clustering
Background of Clustering
  • Definition of Clustering
    • For a given set of data points, partitioning them into one or more groups of similar objects.
    • “Similarity” is often defined with the use of some distance measure.
  • Difference between “group by” queries and clustering.
background of clustering1
Background of Clustering
  • Some of the most popular clustering algorithms:
    • K- Means, BIRCH, CURE, Density Based Clustering.
  • Clustering has many applications in data bases, information visualization, data mining.
  • What are Oultiers?
motivation
Motivation
  • Challenge in Streaming Environment:
    • Clustering is an expensive process.
    • Resource constraints.
    • Infinite streams.
  • Can simply extending one pass algorithms for static databases to stream processing suffice?
motivation1
Motivation
  • Requirements of clustering for stream processing:
    • Statistical summary information storage.
    • Efficient update process.
    • Ability to cluster for a specific time horizon,
overall solution of the paper
Overall Solution of the Paper
  • Divide the clustering process to two phases

Online Component:

periodically stores detailed summary statistics

Offline Component

uses only the summary statistics to do clustering

micro clusters
Micro-Clusters
  • What is a Micro-Cluster

A Micro-Cluster is a set of individual data points that are close to each other and will be treated as a single unit in further offline Macro-clustering.

View of Micro-Cluster

View of Macro-Cluster

micro clusters1
Micro-Clusters
  • What to Store in a Micro-Cluster

=

Key idea: Additivity Property

pyramidal time frame
Pyramidal Time Frame
  • The micro-clusters are stored at snapshots.
  • The snapshots follow a pyramidal pattern

Snapshot

  • When should we make the snapshot?
pyramidal time frame1
Pyramidal Time Frame
  • Snapshots are classified into different orders which can vary from 1 to log α(T). For example, T is 55, α=2, then we have orders 0 with interval 2^0=1, order 1 with interval 2^1=2, order 2 with interval 2^2=4, order 3 with interval 2^3=8, order 4 with interval 2^4=16, order 5 with interval 2^5=32.
  • For a data stream the maximum number of snap- shots maintained at T time units since the beginning of the stream mining process is

(α + 1) log α(T). (α + 1 for each order)

why pyramidal pattern
Why Pyramidal Pattern?
  • For any user-specified time window of h, at least one stored snapshot can be found within 2 h units of the current time.

Please Note: Only Approximate Answers!!!

micro cluster creation
Micro Cluster Creation
  • It is assumed that a total of q micro-clusters are maintained at any moment by the algorithm.
  • This is done using an offline process (k-means) at the very beginning of the data stream computation process.
online micro cluster maintenance
Online Micro Cluster Maintenance
  • How to deal with a new coming point?
  • Join one of the old cluster
  • Create a new cluster by its own
  • How to deal with the old clusters
  • Delete them(based on relevance stamp)
  • Merge them (merge the closest two)

A merged cluster will have all the IDs its components have

macro cluster creation
Macro-Cluster Creation
  • Based on the Additivity Property of cluster feature vector
macro cluster creation1
Macro-Cluster Creation

Current Time T, the window size is h. That means the user want to find the clusters formed in (T-h, T).

Approach:

  • 1st step: Find the snapshot for T, get the micro-cluster set S(T).
  • 2nd step: Find the snapshot for T-h, get the micro-cluster set S(T-h).
  • Use S(T)-S(T-h)

Specifically, we have a merged cluster with Id list (C1, C2, C3) in S(T)

and a cluster with Id C1 in S(T-h). Then the we use

CFT(C1,C2,C3)-CFT(C1)=CFT(C2,C3), because C1 are formed before

T-h, thus should not contribute to the micro-cluster formed in (T-h,T)

example
Example

C_ID: [C1, C2, C3]

C_ID: [C1]

C_ID: [C2, C3]

Time: T-h

Result: T-h

Time: T

macro cluster creation2
Macro-Cluster Creation
  • Run K-means on Micro-Clusters
how do you feel about this paper
How do you feel about this paper?
  • My feeling:

Quite Fuzzy Results:

Approximation is every where.

Nothing New:

Micro-Clusters, K-means, Cluster Feature Vectors, Pyramidal Time Frame are all old stuffs.

counter example
Counter Example

C_ID: [C1, C2, C3]

C_ID: [C2]

C_ID: [C1, C3]

Result

Time: T

Time: T-h

advertisement
Advertisement
  • Di and Charu’s project deals with:
  • Deterministic Clusters
  • Clusters with Arbitrary Shapes
  • RealExpirations
  • Disk Version
  • Outlier Detection by Free
ad