1 / 16

BIRCH

BIRCH. An Efficient Data Clustering Method for Very Large Databases SIGMOD 96. Introduction. Balanced Iterative Reducing and Clustering using Hierarchies For multi-dimensional dataset Minimized I/O cost (linear : 1 or 2 scan) Full utilization of memory Hierarchies  indexing method.

weylin
Download Presentation

BIRCH

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIRCH An Efficient Data Clustering Method for Very Large Databases SIGMOD 96

  2. Introduction • Balanced Iterative Reducing and Clustering using Hierarchies • For multi-dimensional dataset • Minimized I/O cost (linear : 1 or 2 scan) • Full utilization of memory • Hierarchies  indexing method

  3. Terminology • Property of a cluster • Given N d-dimensional data points • Centroid • Radius • Diameter

  4. Terminology Distance between 2 clusters D0 – Euclidian distance between centroids D1 – Manhattan distance between centroids D2 – average inter-cluster distance D3 – average intra-cluster distance D4 – variance increase distance

  5. Clustering Feature • To calculate centroid, radius, diameter, D0, D1, D2, D3 and D4, not all points are needed • 3 values are stored to represent the cluster (CF) • N – number of points in a cluster • LS – linear sum of points in a cluster • SS – square sum of points in a cluster • CF are additive

  6. CF Tree • Similar to B+-tree, R-tree • Parameter • B – branching factor • T – threshold • Leaf node – contains at most L CF entries, each CF should follows D<T or R<T • Non-leaf node – contains at most B CF entries of its child • Each node should fit into 1 page

  7. BIRCH • Phase 1: Scan dataset once, build a CF tree in memory • Phase 2: (Optional) Condense the CF tree to a smaller CF tree • Phase 3: Global Clustering • Phase 4: (Optional) Clustering Refining (require scan of dataset)

  8. BIRCH

  9. Building CF Tree (Phase 1) • CF of a data point (3,4) is (1,(3,4),25) • Insert a point to the tree • Find the path (based on D0, D1, D2, D3, D4 between CF of children in a non-leaf node) • Modify the leaf • Find closest leaf node entry (based on D0, D1, D2, D3, D4 of CF in leaf node) • Check if it can “absorb” the new data point • Modify the path to the leaf • Splitting – if leaf node is full, split into two leaf node, add one more entry in parent

  10. Building CF Tree (Phase 1) Sum of CF(N,LS,SS) of all children Non-leave node Leave node CF(N,LS,SS) – under condition D<T or R<T

  11. Condensing CF Tree (Phase 2) • Chose a larger T (threshold) • Consider entries in leaf nodes • Reinsert CF entries in the new tree • If new “path” is “before” original “path”, move it to new “path” • If new “path” is the same as original “path”, leave it unchanged

  12. Global Clustering (Phase 3) • Consider CF entries in leaf nodes only • Use centroid as the representative of a cluster • Perform traditional clustering (e.g. agglomerative hierarchy (complete link == D2) or K-mean or CL…) • Cluster CF instead of data points

  13. Cluster Refining (Phase 4) • Require scan of dataset one more time • Use clusters found in phase 3 as seeds • Redistribute data points to their closest seeds and form new clusters • Removal of outliers • Acquisition of membership information

  14. Performance

  15. Visualization

  16. Conclusion • A clustering algorithm taking consideration of I/O costs, memory limitation • Utilize local information (each clustering decision is made without scanning all data points) • Not every data point is equally important for clustering purpose

More Related