Clustering

1. Clustering CS 157B Jonathan Silva

2. What is Clustering? Finding structure in a collection of unlabeled data A cluster is a collection of objects with certain similarities but is dissimilar to objects in other clusters It is the process of separating data into meaningful groups

3. Formal definition Clusters are connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points.

4. Uses for Clustering WWW: document classification; clustering weblog data to discover similar access patterns Marketing: finding groups of customers with similar purchasing habits; trends Biology: classifying different plants and animals by common attributes Genetics: provides information about gene activity in different conditions

5. Uses for Clustering City planning: identifying houses according to type, value, and location Insurance: identifying policy holders with high average claim costs; fraud Earthquake studies: clustering observed epicenters to discover danger areas for potential future earthquakes

6. Requirements for a Clustering Algorithm Scalability Dealing with different types of attributes Discovering clusters with arbitrary shape Ability to deal with outliers and noise Insensitivity to order of input records Interpretability and usability

7. Types of Clustering There are 3 main categories of clustering Partial Clustering : K-means Hierarchical : Agglomerative and Divisive Probabilistic: Mixture of Gaussians

8. K-means Clustering Major clustering technique used due to its computational ease and memory efficiency Used by most search engines Popular unsupervised learning algorithm used in data mining

9. How K-means works Main idea is to identify K centroids which are the mean of all points, for K number of clusters Each point is then associated to the nearest centroid Re-calculate the centroids Repeat process until centroids no longer change

14. Hierarchical Clustering Traditional representation is a tree Agglomerative begins at leaves of tree, while Divisive begins at root Agglomerative clustering techniques are more commonly used compared to divisive techniques

15. How Hierarchical Clustering works Each item is assigned to a cluster Find the most similar or closest pair of clusters and merge them into one Compute similarities or distances between the new cluster and the old clusters Repeat process until all items belong to a single cluster

16. Agglomerative Clustering

18. References http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/ http://www.resample.com/xlminer/help/HClst/HClst_intro.htm http://www.cs.cmu.edu/afs/andrew/course/15/381-f08/www/lectures/clustering.pdf

Clustering

Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering