A Framework for Clustering Evolving Data Streams

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad

Outline • Background of Clustering • Motivation for Clustering over Streaming Data. • Overall Solution • Micro Clusters • Pyramid Time Frame • Macro Cluster • Cluster Maintenance

Background of Clustering • Definition of Clustering • For a given set of data points, partitioning them into one or more groups of similar objects. • “Similarity” is often defined with the use of some distance measure. • Difference between “group by” queries and clustering.

Background of Clustering • Some of the most popular clustering algorithms: • K- Means, BIRCH, CURE, Density Based Clustering. • Clustering has many applications in data bases, information visualization, data mining. • What are Oultiers?

Motivation • Challenge in Streaming Environment: • Clustering is an expensive process. • Resource constraints. • Infinite streams. • Can simply extending one pass algorithms for static databases to stream processing suffice?

Motivation • Requirements of clustering for stream processing: • Statistical summary information storage. • Efficient update process. • Ability to cluster for a specific time horizon,

Overall Solution of the Paper • Divide the clustering process to two phases Online Component: periodically stores detailed summary statistics Offline Component uses only the summary statistics to do clustering

Micro-Clusters • What is a Micro-Cluster A Micro-Cluster is a set of individual data points that are close to each other and will be treated as a single unit in further offline Macro-clustering. View of Micro-Cluster View of Macro-Cluster

Micro-Clusters • What to Store in a Micro-Cluster = Key idea: Additivity Property

Pyramidal Time Frame • The micro-clusters are stored at snapshots. • The snapshots follow a pyramidal pattern … … Snapshot • When should we make the snapshot?

Pyramidal Time Frame • Snapshots are classified into different orders which can vary from 1 to log α(T). For example, T is 55, α=2, then we have orders 0 with interval 2^0=1, order 1 with interval 2^1=2, order 2 with interval 2^2=4, order 3 with interval 2^3=8, order 4 with interval 2^4=16, order 5 with interval 2^5=32. • For a data stream the maximum number of snapshots maintained at T time units since the beginning of the stream mining process is (α + 1) log α(T). (α + 1 for each order)

Why Pyramidal Pattern? • For any user-specified time window of h, at least one stored snapshot can be found within 2 h units of the current time. Please Note: Only Approximate Answers!!!

Micro Cluster Creation • It is assumed that a total of q micro-clusters are maintained at any moment by the algorithm. • This is done using an offline process (k-means) at the very beginning of the data stream computation process.

Online Micro Cluster Maintenance • How to deal with a new coming point? • Join one of the old cluster • Create a new cluster by its own • How to deal with the old clusters • Delete them(based on relevance stamp) • Merge them (merge the closest two) A merged cluster will have all the IDs its components have

Macro-Cluster Creation • Based on the Additivity Property of cluster feature vector

Macro-Cluster Creation Current Time T, the window size is h. That means the user want to find the clusters formed in (T-h, T). Approach: • 1st step: Find the snapshot for T, get the micro-cluster set S(T). • 2nd step: Find the snapshot for T-h, get the micro-cluster set S(T-h). • Use S(T)-S(T-h) Specifically, we have a merged cluster with Id list (C1, C2, C3) in S(T) and a cluster with Id C1 in S(T-h). Then the we use CFT(C1,C2,C3)-CFT(C1)=CFT(C2,C3), because C1 are formed before T-h, thus should not contribute to the micro-cluster formed in (T-h,T)

Example C_ID: [C1, C2, C3] C_ID: [C1] C_ID: [C2, C3] Time: T-h Result: T-h Time: T

Macro-Cluster Creation • Run K-means on Micro-Clusters

How do you feel about this paper? • My feeling: Quite Fuzzy Results: Approximation is every where. Nothing New: Micro-Clusters, K-means, Cluster Feature Vectors, Pyramidal Time Frame are all old stuffs.

Counter Example C_ID: [C1, C2, C3] C_ID: [C2] C_ID: [C1, C3] Result Time: T Time: T-h

Advertisement • Di and Charu’s project deals with: • Deterministic Clusters • Clusters with Arbitrary Shapes • RealExpirations • Disk Version • Outlier Detection by Free

A Framework for Clustering Evolving Data Streams

A Framework for Clustering Evolving Data Streams

Presentation Transcript

Semantic Streams: a Framework for Composable Semantic Interpretation of Sensor Data

A Distributed Clustering Framework for MANETS

A Probabilistic Framework for Semi-Supervised Clustering

Clustering Data Streams

Clustering Data Streams

A Distributed Framework for Collaborative Annotation of Streams

A Scalable Framework for the Collaborative Annotation of Live Data Streams

Data Streams

Data Streams

HE-Tree: a framework for detecting changes in clustering structure for categorical data streams

Algorithms for Data Streams

SCALE: a scalable framework for efficiently clustering transactional data

Clustering over Multiple Evolving Streams by Events and Correlations

Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

A Secure Clustering Algorithm for Distributed Data Streams

A General Framework for Mining Massive Data Streams

Data Mining for Data Streams

A Framework for Clustering Evolving Data Streams