1 / 19

BIRCH: An Efficient Data Clustering Method for Very Large Databases

BIRCH: An Efficient Data Clustering Method for Very Large Databases. Tian Zhang, Raghu Ramakrishnan, Miron Livny. Outline of the Paper. Background Clustering Feature and CF Tree The BIRCH Clustering Algorithm Performance Studies. Background. Question: How to cluster large Datasets?

mchaves
Download Presentation

BIRCH: An Efficient Data Clustering Method for Very Large Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny

  2. Outline of the Paper • Background • Clustering Feature and CF Tree • The BIRCH Clustering Algorithm • Performance Studies

  3. Background • Question: How to cluster large Datasets? • The limited Memory • Minimize the I/O cost • Answer: Birch

  4. Background (Single cluster) • Given N d-dimensional data points : {Xi} • “Centroid” • “radius” • “diameter”

  5. Background (two clusters) Given the centroids : X0 and Y0, • The centroid Euclidean distance D0: • The centroid Manhattan distance D1:

  6. Background ( two clusters) • Average inter-cluster distance D2= • Average intra-cluster distance D3=

  7. Clustering Feature • CF = (N, LS, SS) • N =|C| “number of data points” • LS = “linear sum of N data points” • SS = “square sum of N data points ”

  8. CF AdditiveTheorem • Assume CF1=(N1, LS1 ,SS1), CF2 =(N2,LS2,SS2) . • Information stored in CFs is sufficient to compute: • Centroids • Measures for the compactness of clusters • Distance measure for clusters

  9. CF-Tree • height-balanced tree • two parameters: • branching factor • B : An internal node contains at most B entries [CFi, childi] • L : A leaf node contains at most L entries [CFi] • threshold T • The diameter of all entries in a leaf node is at most T • Leaf nodes are connected via prev and next pointers

  10. CF tree example

  11. BIRCH Algorithm Overview

  12. CF tree construction • Transform a point p into a CF-vector CFi= (1, p, p2) • Set T (threshold value, diameter or radius) • B (Branching factor) and L (number of entries in a leaf node) are determined by the value of P (page size)

  13. Phase 1 Start CF tree t1 of initial T Continue scanning data and insert into t1 Out of memory Finish scanning data Result? • increase T • rebuild CF tree t2 of new T from CF tree t1. if a leaf entry is a • potential outlier, write to disk. Otherwise use it. • t1 <= t2 Otherwise Out of disk space Result? Re-absorb potential outliers into t1 Re-absorb potential outliers into t1

  14. Insertion Algorithm • Identifying the appropriate leaf • Modifying the leaf: assume the closest leaf entry, say Li, • Li can `absorb' `Ent' if T is satisfied • Add a new entry for `Ent' to the leaf • Split the leaf node • Modifying the path to the leaf: • The parent has space for this entry • Split the parent, and so on up to the root

  15. Phase 2 (optional) • Scans leaf entries in the initial CF tree to rebuild a smaller CF tree and remove more outliers.

  16. Phase 3: Global Clustering • Use an existing global or semi-global algorithm to cluster all the leaf entries across the boundaries of different nodes. • This way we can overcome the following anomaly: • Anomaly: Depending upon the order of data input and the degree of skew, it is also possible that two subclusters that should not be in one cluster are kept in the same node.

  17. Phase 4 (optional) • Additional passes over the data set to improve the quality of the clusters. • Uses centroids of the clusters produced by phase 3 as seeds. • Redistributes data points to its closest seed.

  18. Comparison of BIRCH and CLARANS

  19. Summary • Compared with previous distance-based approached (e.g, K-Means and CLARANS), BIRCH is appropriate for very large datasets. • BIRCH can work with any given amount of memory.

More Related