1 / 21

A Framework for Clustering Evolving Data Streams

A Framework for Clustering Evolving Data Streams. Charu C. Aggarwal, Jiawei Han, Jianyong Wang and Philip S. Yu Proc. 2003 Int. Conf. on Very Large Data Bases (VLDB'03). 報告人 : 吳建良. Outline. Cluster analysis: A general overview Developed methodology Micro-cluster analysis and maintenance

ecoppedge
Download Presentation

A Framework for Clustering Evolving Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang and Philip S. Yu Proc. 2003 Int. Conf. on Very Large Data Bases (VLDB'03) 報告人:吳建良

  2. Outline • Cluster analysis: A general overview • Developed methodology • Micro-cluster analysis and maintenance • Macro-cluster analysis • Evolution analysis • Empirical results

  3. Cluster analysis: A general overview • What is cluster analysis?—Grouping a set of data objects into a set of clusters s.t. the intra-cluster similarity is high and the inter-cluster similarity is low • New requirements in stream clustering • Generate high-quality clusters in one scan • High quality, efficient incremental clustering • Analysis should take care of multi-dimensional space • Provide flexibility to compute clusters over user-defined time period

  4. Developed methodology: Outline • Methodology • Divide the clustering process into online and offline components • Online: periodically stores summary statistics about the stream data • Micro-clustering: better quality than k-means • Online processing and maintenance • Pyramidal time window: register dynamic changes • Offline: answers various user queries based on the stored summary statistics

  5. Clustering Feature: CF = (N, LS, SS) • N: Number of data points • LS: Ni=1=Xi • SS: Ni=1=Xi2 Clustering Feature Vector • Originated from BIRCH CF = (5, (16, 30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)

  6. Micro-Clusters: Design Methodology • Data streams • Multi-dimensional points with time stamps T1, … Tk …. • Each point contains d dimensions, i.e., • A micro-cluster for n points is defined as a (2*d + 3) tuple: - the sum of the squares of the data values - the sum of the data values - the sum of the squares of the time stamps - the sum of the time stamps - the number of data points

  7. Pyramidal Time Frame • Snapshots • The micro-clusters are also stored at particular moments in the stream • Classified into different frame number which can vary from 0 to log(T), where T is the clock time elapsed since the beginning of the stream • The frame number of a particular class of snapshots define the level of granularity in time at which the snapshots are maintained

  8. Maintain Snapshot Frame Table • The Rules for insertion of a snapshot t into frame table • If (t mod αi)=0 but (t mod αi+1) ≠0, t is inserted into frame number i • Each slot has a max_capacity. If the slot has already reached its max_capacity, the oldest snapshot is removed and the new snapshot inserted • Example: • α= 2 • max_capacity =3

  9. Micro-clusters Maintenance • The micro-clustering stage is online, statistical data collection – not dependant on user input • Initial creation of q micro-clusters M1 … Mq • Use k-means clustering algorithm • q is usually significantly larger than # of natural clusters • q is determined by the amount of available memory • Each micro-cluster is associated with a unique id when it is created

  10. Incremental Update of Micro-clusters • When a new data point Xik arrives, it is either added to a micro-cluster, or a new micro-cluster is created • If Xik falls within the maximum boundary of its closest micro-cluster Mp, Xik is added to Mp • Maximum boundary: the RMS deviation of the data points in Mp from its centroid • RMS deviation: • Otherwise, a new micro-cluster is created for Xik

  11. Incremental Update of Micro-clusters (Contd.) • Delete an old cluster or merge two closest clusters? • A micro-cluster is deleted whenever the average time stamp of the last m points is less than a given threshold • Otherwise, the two closest micro-cluster are merged by adding corresponding cluster feature vectors • An idlist is created for the two micro-clusters

  12. Macro-Cluster Creation • Macro-clusters are created over a user-specified time horizon h • LetS(tc): the set of micro-clusters at time tc S(tc-h): the set of micro-clusters at time tc-h • The new set of micro-clusters N(tc-h) are created by subtractingS(tc-h) from S(tc) • Subtractive property • Let C1 and C2 be two sets of points such that Then

  13. Macro-Cluster Creation (Contd.) • Each micro-cluster in N(tc-h) is treated as pseudo-point • Each pseudo-point has a weight proportional to the number of points inside it • A k-means clustering approach is applied to this set of pseudo-points in order to create a higher level of macro-clusters

  14. Evolution Analysis of Micro-Clusters • In many case, it is desirable to find how the micro-clusters have changed over time • Given a user-specified time-horizon h and two clock times, t1 and t2 (where t1 < t2 ) • Analyze the evolution nature of data arriving between (t2–h, t2), and the data arriving between (t1–h, t1)

  15. Evolution Analysis of Micro-Clusters (Contd.) • The following questions • Are there new clusters in the data at time t1 which were not present at time t2? • Find micro-clusters in N(t2-h) which are not present inN(t1-h) • Have some of the original clusters been lost? • Find micro-clusters in N(t1-h) which are not present inN(t2-h) • Have some of the original clusters at time t1, shifted in position and nature?

  16. Empirical Result • Data sets • Real Data Sets: Network Intrusion and KDD Cup 98 data set (Charitable Donation) • Synthetic Data Sets: • Gaussian Distribution • Base Size: 100k ~ 1000k points • # Cluster: 4 ~ 64 • Dimensionality: 10 ~ 100

  17. Cluster Quality (Network Intrusion) Horizon H=256, Stream_speed=200 Horizon H=1, Stream_speed=2000

  18. Cluster Quality (Charitable Donation) Horizon H=16, Stream_speed=200 Horizon H=4, Stream_speed=2000

  19. Scalability Stream_speed=2000

  20. Sum of Square Distance (SSQ) • Assume there are a total N points in the past horizon H at current time Tc , where is the centroid of macro-cluster closest to pi

  21. 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 K-means clustering algorithm 10 9 8 7 6 5 Update the cluster means Assign each points to closest center 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K points as initial cluster center Update the cluster means

More Related