1 / 19

Clustering Data Streams

Clustering Data Streams. Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague. Data Stream. Massive data sets accumulated at an astonishing rate. Examples: Tracking network data to study change in traffic patterns and possible intrusions

Download Presentation

Clustering Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague

  2. Data Stream • Massive data sets accumulated at an astonishing rate. • Examples: • Tracking network data to study change in traffic patterns and possible intrusions • Tracking meteorological data, such as temperatures

  3. NASA MISR satellite Collects several TB of satellite imagery data daily

  4. Challenges • Compactness of data representation • Fast, incremental processing of new data points (one-pass and linear access of data) • Clear and fast identification of changes in evolving clustering models

  5. Compactness • Utilize a data structure that summarizes a group of data points, minimizing the storage space • The space required does not grow appreciably with the number of points processed

  6. Incremental Processing of data • When clustering new data points, the algorithm should not require comparison with all the points processed in the past • The data must be processed as they are produced. Linear scan is required, random access is prohibitively expensive.

  7. Identification of Changes The algorithm must be able to: • diagnose changes in evolving data streams • distinguish outliers from data points that represent a new cluster

  8. Current Algorithms • BIRCH • STREAM • CLU-STREAM • …

  9. BIRCH • Use CF vectors to store data • CF = (N, ∑Xi2 , ∑ Xi ) Xiis a vector • Store the number of points, the linear sum and the square sum of all data points in a micro-cluster • Sufficient to calculate centroids, radius, diameter and distances

  10. B-Tree Root 29 7 16 22 39* 3* 5* 19* 20* 22* 24* 27* 38* 2* 7* 14* 16* 29* 33* 34*

  11. CF3 CF6 Non-Leaf node CF1 CF2 CF3 CF4 CF5 CF6 Leaf node Building of CF Tree • B-Tree, with a branch factor B, threshold T and L maximum number of entries in a leave node

  12. Adjusting CF Tree • Increases the threshold T so that each leaf entry to absorb more points. T can be set as radius or diameter. • Leaf entries with “far fewer” points are regarded as “outliers” and written back to disk.

  13. STREAM • Process data streams in batches of points • Use weighted centroids Ci to represent ith batch of points. • Recursively cluster the weighted centroids until k-clusters

  14. Problems with BIRCH & STREAM • Old data points are equally important as new data points • May not be able to detect new trends in evolving data stream

  15. CLU-STREAM • Also use CF vectors to store data summary • Use time stamps to record the elapsed time from the beginning • Take snapshots at different time stamp, favoring the most recent data (Snapshot: micro-clusters stored at particular moments in the stream)

  16. CLU-STREAM (continue) • A snapshot contains q micro-clusters, q depends on the memory available • New data points will be assigned to one of the micro-clusters in previous snapshot if it falls within the maximum boundary of that micro-cluster.

  17. CLU-STREAM (continue) • If a new data points fails to fit into any current cluster, a new cluster is created, and an existing one is deleted or two merged. • A cluster is removed if the average time-stamp when it absorbs m new data points is least recent.

  18. Detect New Trends • Comparing clustering results from snapshots to snapshots reveals trends in evolving data stream.

  19. References • Aggarwal, C. C., Han J., Wang, J. & Yu, P. S. (2003). A Framework for Clustering Evolving Data Stream. In Proc. of the 29th VLDB Conference. • Barbara, D. (2003). Requirements for Clustering Data Streams. SIGKDD Explorations, 3 (2), 23-27. • Ester M., Kriegel H.-P., Sander J. & Xu X (1998). Clustering for Mining in Large Spatial Databases. Special Issue on Data Mining, KI-Journal, ScienTec Publishing, No. 1. • Guha, S., Meyerson, A., Mishra, N. & Motwani, R., Callaghan, L. (2003). Clustering Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineering, 15 (3), 515-528. • Zhang, T., Ramakrishnan R. & Livny, M. (1996). BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proc. of ACM SIGMOD International Conference on Management of Data.

More Related