A framework for clustering evolving data streams
1 / 21

A Framework for Clustering Evolving Data Streams - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

A Framework for Clustering Evolving Data Streams. Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad. Outline. Background of Clustering Motivation for Clustering over Streaming Data. Overall Solution Micro Clusters Pyramid Time Frame

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

A Framework for Clustering Evolving Data Streams

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

A Framework for Clustering Evolving Data Streams

Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu

Presented by: Di Yang

Charudatta Wad


  • Background of Clustering

  • Motivation for Clustering over Streaming Data.

  • Overall Solution

  • Micro Clusters

  • Pyramid Time Frame

  • Macro Cluster

  • Cluster Maintenance

Background of Clustering

  • Definition of Clustering

    • For a given set of data points, partitioning them into one or more groups of similar objects.

    • “Similarity” is often defined with the use of some distance measure.

  • Difference between “group by” queries and clustering.

Background of Clustering

  • Some of the most popular clustering algorithms:

    • K- Means, BIRCH, CURE, Density Based Clustering.

  • Clustering has many applications in data bases, information visualization, data mining.

  • What are Oultiers?


  • Challenge in Streaming Environment:

    • Clustering is an expensive process.

    • Resource constraints.

    • Infinite streams.

  • Can simply extending one pass algorithms for static databases to stream processing suffice?


  • Requirements of clustering for stream processing:

    • Statistical summary information storage.

    • Efficient update process.

    • Ability to cluster for a specific time horizon,

Overall Solution of the Paper

  • Divide the clustering process to two phases

    Online Component:

    periodically stores detailed summary statistics

    Offline Component

    uses only the summary statistics to do clustering


  • What is a Micro-Cluster

    A Micro-Cluster is a set of individual data points that are close to each other and will be treated as a single unit in further offline Macro-clustering.

View of Micro-Cluster

View of Macro-Cluster


  • What to Store in a Micro-Cluster


Key idea: Additivity Property

Pyramidal Time Frame

  • The micro-clusters are stored at snapshots.

  • The snapshots follow a pyramidal pattern


  • When should we make the snapshot?

Pyramidal Time Frame

  • Snapshots are classified into different orders which can vary from 1 to log α(T). For example, T is 55, α=2, then we have orders 0 with interval 2^0=1, order 1 with interval 2^1=2, order 2 with interval 2^2=4, order 3 with interval 2^3=8, order 4 with interval 2^4=16, order 5 with interval 2^5=32.

  • For a data stream the maximum number of snap- shots maintained at T time units since the beginning of the stream mining process is

    (α + 1) log α(T). (α + 1 for each order)

Why Pyramidal Pattern?

  • For any user-specified time window of h, at least one stored snapshot can be found within 2 h units of the current time.

Please Note: Only Approximate Answers!!!

Micro Cluster Creation

  • It is assumed that a total of q micro-clusters are maintained at any moment by the algorithm.

  • This is done using an offline process (k-means) at the very beginning of the data stream computation process.

Online Micro Cluster Maintenance

  • How to deal with a new coming point?

  • Join one of the old cluster

  • Create a new cluster by its own

  • How to deal with the old clusters

  • Delete them(based on relevance stamp)

  • Merge them (merge the closest two)

A merged cluster will have all the IDs its components have

Macro-Cluster Creation

  • Based on the Additivity Property of cluster feature vector

Macro-Cluster Creation

Current Time T, the window size is h. That means the user want to find the clusters formed in (T-h, T).


  • 1st step: Find the snapshot for T, get the micro-cluster set S(T).

  • 2nd step: Find the snapshot for T-h, get the micro-cluster set S(T-h).

  • Use S(T)-S(T-h)

    Specifically, we have a merged cluster with Id list (C1, C2, C3) in S(T)

    and a cluster with Id C1 in S(T-h). Then the we use

    CFT(C1,C2,C3)-CFT(C1)=CFT(C2,C3), because C1 are formed before

    T-h, thus should not contribute to the micro-cluster formed in (T-h,T)


C_ID: [C1, C2, C3]

C_ID: [C1]

C_ID: [C2, C3]

Time: T-h

Result: T-h

Time: T

Macro-Cluster Creation

  • Run K-means on Micro-Clusters

How do you feel about this paper?

  • My feeling:

    Quite Fuzzy Results:

    Approximation is every where.

    Nothing New:

    Micro-Clusters, K-means, Cluster Feature Vectors, Pyramidal Time Frame are all old stuffs.

Counter Example

C_ID: [C1, C2, C3]

C_ID: [C2]

C_ID: [C1, C3]


Time: T

Time: T-h


  • Di and Charu’s project deals with:

  • Deterministic Clusters

  • Clusters with Arbitrary Shapes

  • RealExpirations

  • Disk Version

  • Outlier Detection by Free

  • Login