190 likes | 215 Views
BIRCH is a method for clustering very large databases efficiently by balancing iterative reduction and clustering using hierarchies. It minimizes I/O cost and maximizes memory utilization, employing indexing methods like CF Tree. The process involves phases such as building the CF Tree, global clustering, and optional refining, enhancing performance and enabling visualization.
E N D
BIRCH An Efficient Data Clustering Method for Very Large Databases SIGMOD 96
Introduction • Balanced Iterative Reducing and Clustering using Hierarchies • For multi-dimensional dataset • Minimized I/O cost (linear : 1 or 2 scan) • Full utilization of memory • Hierarchies indexing method
Terminology • Property of a cluster • Given N d-dimensional data points • Centroid • Radius • Diameter
Terminology Distance between 2 clusters D0 – Euclidian distance between centroids D1 – Manhattan distance between centroids D2 – average inter-cluster distance D3 – average intra-cluster distance D4 – variance increase distance
Clustering Feature • To calculate centroid, radius, diameter, D0, D1, D2, D3 and D4, not all points are needed • 3 values are stored to represent the cluster (CF) • N – number of points in a cluster • LS – linear sum of points in a cluster • SS – square sum of points in a cluster • CF are additive
CF Tree • Similar to B+-tree, R-tree • Parameter • B – branching factor • T – threshold • Leaf node – contains at most L CF entries, each CF should follows D<T or R<T • Non-leaf node – contains at most B CF entries of its child • Each node should fit into 1 page
BIRCH • Phase 1: Scan dataset once, build a CF tree in memory • Phase 2: (Optional) Condense the CF tree to a smaller CF tree • Phase 3: Global Clustering • Phase 4: (Optional) Clustering Refining (require scan of dataset)
Building CF Tree (Phase 1) • CF of a data point (3,4) is (1,(3,4),25) • Insert a point to the tree • Find the path (based on D0, D1, D2, D3, D4 between CF of children in a non-leaf node) • Modify the leaf • Find closest leaf node entry (based on D0, D1, D2, D3, D4 of CF in leaf node) • Check if it can “absorb” the new data point • Modify the path to the leaf • Splitting – if leaf node is full, split into two leaf node, add one more entry in parent
Building CF Tree (Phase 1) Sum of CF(N,LS,SS) of all children Non-leave node Leave node CF(N,LS,SS) – under condition D<T or R<T
Condensing CF Tree (Phase 2) • Chose a larger T (threshold) • Consider entries in leaf nodes • Reinsert CF entries in the new tree • If new “path” is “before” original “path”, move it to new “path” • If new “path” is the same as original “path”, leave it unchanged
Global Clustering (Phase 3) • Consider CF entries in leaf nodes only • Use centroid as the representative of a cluster • Perform traditional clustering (e.g. agglomerative hierarchy (complete link == D2) or K-mean or CL…) • Cluster CF instead of data points
Cluster Refining (Phase 4) • Require scan of dataset one more time • Use clusters found in phase 3 as seeds • Redistribute data points to their closest seeds and form new clusters • Removal of outliers • Acquisition of membership information
Conclusion • A clustering algorithm taking consideration of I/O costs, memory limitation • Utilize local information (each clustering decision is made without scanning all data points) • Not every data point is equally important for clustering purpose