birch balanced iterative reducing and clustering using hierarchies n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Birch: Balanced Iterative Reducing and Clustering using Hierarchies PowerPoint Presentation
Download Presentation
Birch: Balanced Iterative Reducing and Clustering using Hierarchies

Loading in 2 Seconds...

play fullscreen
1 / 33

Birch: Balanced Iterative Reducing and Clustering using Hierarchies - PowerPoint PPT Presentation


  • 263 Views
  • Uploaded on

Birch: Balanced Iterative Reducing and Clustering using Hierarchies. By Tian Zhang, Raghu Ramakrishnan. Presented by Vladimir Jelić 3218 /10 e-mail: jelicvladimir5@gmail.com. What is Data Clustering?. A cluster is a closely-packed group.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Birch: Balanced Iterative Reducing and Clustering using Hierarchies' - zalika


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
birch balanced iterative reducing and clustering using hierarchies

Birch:Balanced Iterative Reducing and Clustering using Hierarchies

By Tian Zhang, Raghu Ramakrishnan

Presented by

Vladimir Jelić 3218/10

e-mail: jelicvladimir5@gmail.com

what is data clustering
What is Data Clustering?
  • A cluster is a closely-packed group.
  • A collection of data objects that are similar to one another and treated collectively as a group.
  • Data Clustering is the partitioning of a dataset into clusters

Vladimir Jelić (jelicvladimir5@gmail.com)

data clustering
Data Clustering
  • Helps understand the natural grouping or structure in a dataset
  • Provided a large set of multidimensional data
    • Data space is usually not uniformly occupied
    • Identify the sparse and crowded places
    • Helps visualization

Vladimir Jelić (jelicvladimir5@gmail.com)

some clustering applications
Some Clustering Applications
  • Biology – building groups of genes with related patterns
  • Marketing – partition the population of consumers to market segments
  • Division of WWW pages into genres.
  • Image segmentations – for object recognition
  • Land use – Identification of areas of similar land use from satellite images

Vladimir Jelić (jelicvladimir5@gmail.com)

clustering problems
Clustering Problems
  • Today many datasets are too large to fit into main memory
  • The dominating cost of any clustering algorithm is I/O, because seek times on disk are orders of a magnitude higher than RAM access times

Vladimir Jelić (jelicvladimir5@gmail.com)

previous work
Previous Work
  • Two classes of clustering algorithms:
    • Probability-Based
      • Examples: COBWEB and CLASSIT
    • Distance-Based
      • Examples: KMEANS, KMEDOIDS, and CLARANS

Vladimir Jelić (jelicvladimir5@gmail.com)

previous work cobweb
Previous Work: COBWEB
  • Probabilistic approach to make decisions
  • Clusters are represented with probabilistic description
  • Probability representations of clusters is expensive
    • Every instance (data point) translates into a terminal node in the hierarchy, so large hierarchies tend to over fit data

Vladimir Jelić (jelicvladimir5@gmail.com)

previous work kmeans
Previous Work: KMeans
  • Distance based approach, so there must be distance measurement between any two instances
  • Sensitive to instance order
  • Instances must be stored in memory
  • All instances must be initially available
  • May have exponential run time

Vladimir Jelić (jelicvladimir5@gmail.com)

previous work clarans
Previous Work: CLARANS
  • Also distance based approach, so there must be distance measurement between any two instances
  • computational complexity of CLARANS is about O(n2)
  • Sensitive to instance order
  • Ignore the fact that not all data points in the dataset are equally important

Vladimir Jelić (jelicvladimir5@gmail.com)

contributions of birch
Contributions of BIRCH
  • Each clustering decision is made without scanning all data points
  • BIRCH exploits the observation that the data space is usually not uniformly occupied, and hence not every data point is equally important for clustering purposes
  • BIRCH makes full use of available memory to derive the finest possible subclusters ( to ensure accuracy) while minimizing I/O costs ( to ensure efficiency)

Vladimir Jelić (jelicvladimir5@gmail.com)

background knowledge 1
Background Knowledge (1)
  • Given a cluster of instances , we define:

Centroid:

Radius:

Diameter:

Vladimir Jelić (jelicvladimir5@gmail.com)

background knowledge 2
Background Knowledge (2)

centroid Euclidian distance:

centroid Manhattan distance:

average inter-cluster:

average intra-cluster:

variance increase:

Vladimir Jelić (jelicvladimir5@gmail.com)

clustering features cf
Clustering Features (CF)
    • The Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set.
  • Each entry in the CF tree represents a cluster of objects and is characterized by a 3-tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined in the following.

Vladimir Jelić (jelicvladimir5@gmail.com)

clustering feature cf
Clustering Feature (CF)
  • Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N,

CF = (N, LS, SS)

  • N is the number of data points in the cluster,
  • LS is the linear sum of the N data points,
  • SS is the square sum of the N data points.

Vladimir Jelić (jelicvladimir5@gmail.com)

cf additivity theorem 1
CF Additivity Theorem (1)
  • If CF1 = (N1, LS1, SS1), and

CF2 = (N2 ,LS2, SS2) are the CF entries of two disjoint sub-clusters.

  • The CF entry of the sub-cluster formed by merging the two disjoin sub-clusters is:

CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2)

Vladimir Jelić (jelicvladimir5@gmail.com)

cf additivity theorem 2

CF = (5, (16,30),(54,190))

(3,4)

(2,6)

(4,5)

(4,7)

(3,8)

CF Additivity Theorem (2)

Example:

Vladimir Jelić (jelicvladimir5@gmail.com)

properties of cf tree
Properties of CF-Tree
  • Each non-leaf node has at most B entries
  • Each leaf node has at most L CF entries which each satisfy threshold T
  • Node size is determined by dimensionality of data space and input parameter P (page size)

Vladimir Jelić (jelicvladimir5@gmail.com)

cf tree insertion
CF Tree Insertion
  • Identifying the appropriate leaf: recursively descending the CF tree and choosing the closest child node according to a chosen distance metric
  • Modifying the leaf: test whether the leaf can absorb the node without violating the threshold. If there is no room, split the node
  • Modifying the path: update CF information up the path.

Vladimir Jelić (jelicvladimir5@gmail.com)

slide19

Example of the BIRCH Algorithm

New subcluster

sc4

sc5

sc8

sc6

sc7

LN3

sc3

LN2

sc1

sc2

Root

LN1

LN2

LN3

LN1

sc8

sc5

sc3

sc6

sc7

sc1

sc4

sc2

Vladimir Jelić (jelicvladimir5@gmail.com)

slide20

Merge Operation in BIRCH

If the branching factor of a leaf node can not exceed 3, then LN1 is split

sc4

sc1

sc5

sc3

sc6

sc2

sc7

sc8

LN2

LN1”

LN3

Root

LN1’

LN2

LN3

LN1’

LN1”

sc8

sc5

sc3

sc6

sc7

sc1

sc4

sc2

Vladimir Jelić (jelicvladimir5@gmail.com)

slide21

Merge Operation in BIRCH

If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one

sc3

sc6

sc1

sc4

Root

sc2

sc7

sc5

NLN1

sc8

LN3

LN2

NLN2

LN1’

LN1”

LN1’

LN1”

LN2

LN3

sc8

sc1

sc4

sc7

sc3

sc6

sc2

sc5

Vladimir Jelić (jelicvladimir5@gmail.com)

slide22

Merge Operation in BIRCH

Assume that the subclusters are numbered according to the order of formation

sc5

sc6

sc3

sc2

root

sc1

LN1

sc4

LN2

LN1

LN2

sc6

sc1

sc5

sc2

sc3

sc4

Vladimir Jelić (jelicvladimir5@gmail.com)

slide23

Merge Operation in BIRCH

If the branching factor of a leaf node can not exceed 3, then LN2 is split

sc5

sc6

sc2

sc1

sc3

sc4

root

LN1

LN2”

LN2”

LN2’

LN2’

LN1

sc6

sc5

sc3

sc1

sc4

sc2

Vladimir Jelić (jelicvladimir5@gmail.com)

slide24

Merge Operation in BIRCH

LN2’ and LN1 will be merged, and the newly formed node

wil be split immediately

sc2

sc5

sc6

sc3

sc4

sc1

LN3’

LN3”

root

LN2”

LN2”

LN3”

LN3’

sc6

sc2

sc3

sc1

sc4

sc5

Vladimir Jelić (jelicvladimir5@gmail.com)

birch clustering algorithm 1
Birch Clustering Algorithm (1)
  • Phase 1: Scan all data and build an initial in-memory CF tree.
  • Phase 2: condense into desirable length by building a smaller CF tree.
  • Phase 3: Global clustering
  • Phase 4: Cluster refining – this is optional, and requires more passes over the data to refine the results

Vladimir Jelić (jelicvladimir5@gmail.com)

birch clustering algorithm 2
Birch Clustering Algorithm (2)

Vladimir Jelić (jelicvladimir5@gmail.com)

birch phase 1
Birch – Phase 1
  • Start with initial threshold and insert points into the tree
  • If run out of memory, increase thresholdvalue, and rebuild a smaller tree by reinserting values from older tree and then other values
  • Good initial threshold is important but hard to figure out
  • Outlier removal – when rebuilding tree remove outliers

Vladimir Jelić (jelicvladimir5@gmail.com)

birch phase 2
Birch - Phase 2
  • Optional
  • Phase 3 sometime have minimum size which performs well, so phase 2 prepares the tree for phase 3.
  • BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and groups dense clusters into larger ones.

Vladimir Jelić (jelicvladimir5@gmail.com)

birch phase 3
Birch – Phase 3
  • Problems after phase 1:
    • Input order affects results
    • Splitting triggered by node size
  • Phase 3:
    • cluster all leaf nodes on the CF values according to an existing algorithm
    • Algorithm used here: agglomerative hierarchical clustering

Vladimir Jelić (jelicvladimir5@gmail.com)

birch phase 4
Birch – Phase 4
  • Optional
  • Do additional passes over the dataset & reassign data points to the closest centroid from phase 3
  • Recalculating the centroids and redistributing the items.
  • Always converges (no matter how many time phase 4 is repeated)

Vladimir Jelić (jelicvladimir5@gmail.com)

conclusions 1
Conclusions (1)
  • Birch performs faster than existing algorithms (CLARANS and KMEANS) on large datasets
  • Scans whole data only once
  • Handles outliers better
  • Superior to other algorithms in stability and scalability

Vladimir Jelić (jelicvladimir5@gmail.com)

conclusions 2
Conclusions (2)

Since each node in a CF tree can hold only a limited number of entries due to the size, a CF tree node doesn’t always correspond to what a user may consider a nature cluster. Moreover, if the clusters are not spherical in shape, it doesn’t perform well because it uses the notion of radius or diameter to control the boundary of a cluster

Vladimir Jelić (jelicvladimir5@gmail.com)

references
References

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method for very large databases. SIGMOD'96

Jan Oberst: Efficient Data Clustering and How to Groom Fast-Growing Trees

Tan, Steinbach, Kumar: Introduction to Data Mining

Vladimir Jelić (jelicvladimir5@gmail.com)