170 likes | 350 Views
What is Clustering?. Finding structure in a collection of unlabeled dataA cluster is a collection of objects with certain similarities but is dissimilar to objects in other clustersIt is the process of separating data into meaningful groups. Formal definition. Clusters are connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points..
E N D
1. Clustering CS 157B
Jonathan Silva
2. What is Clustering? Finding structure in a collection of unlabeled data
A cluster is a collection of objects with certain similarities but is dissimilar to objects in other clusters
It is the process of separating data into meaningful groups
3. Formal definition Clusters are connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points.
4. Uses for Clustering WWW: document classification; clustering weblog data to discover similar access patterns
Marketing: finding groups of customers with similar purchasing habits; trends
Biology: classifying different plants and animals by common attributes
Genetics: provides information about gene activity in different conditions
5. Uses for Clustering City planning: identifying houses according to type, value, and location
Insurance: identifying policy holders with high average claim costs; fraud
Earthquake studies: clustering observed epicenters to discover danger areas for potential future earthquakes
6. Requirements for a Clustering Algorithm Scalability
Dealing with different types of attributes
Discovering clusters with arbitrary shape
Ability to deal with outliers and noise
Insensitivity to order of input records
Interpretability and usability
7. Types of Clustering There are 3 main categories of clustering
Partial Clustering : K-means
Hierarchical : Agglomerative and Divisive
Probabilistic: Mixture of Gaussians
8. K-means Clustering Major clustering technique used due to its computational ease and memory efficiency
Used by most search engines
Popular unsupervised learning algorithm used in data mining
9. How K-means works Main idea is to identify K centroids which are the mean of all points, for K number of clusters
Each point is then associated to the nearest centroid
Re-calculate the centroids
Repeat process until centroids no longer change
14. Hierarchical Clustering Traditional representation is a tree
Agglomerative begins at leaves of tree, while Divisive begins at root
Agglomerative clustering techniques are more commonly used compared to divisive techniques
15. How Hierarchical Clustering works Each item is assigned to a cluster
Find the most similar or closest pair of clusters and merge them into one
Compute similarities or distances between the new cluster and the old clusters
Repeat process until all items belong to a single cluster
16. Agglomerative Clustering
18. References http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/
http://www.resample.com/xlminer/help/HClst/HClst_intro.htm
http://www.cs.cmu.edu/afs/andrew/course/15/381-f08/www/lectures/clustering.pdf