1 / 17

Clustering

What is Clustering?. Finding structure in a collection of unlabeled dataA cluster is a collection of objects with certain similarities but is dissimilar to objects in other clustersIt is the process of separating data into meaningful groups. Formal definition. Clusters are connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points..

becka
Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Clustering CS 157B Jonathan Silva

    2. What is Clustering? Finding structure in a collection of unlabeled data A cluster is a collection of objects with certain similarities but is dissimilar to objects in other clusters It is the process of separating data into meaningful groups

    3. Formal definition Clusters are connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points.

    4. Uses for Clustering WWW: document classification; clustering weblog data to discover similar access patterns Marketing: finding groups of customers with similar purchasing habits; trends Biology: classifying different plants and animals by common attributes Genetics: provides information about gene activity in different conditions

    5. Uses for Clustering City planning: identifying houses according to type, value, and location Insurance: identifying policy holders with high average claim costs; fraud Earthquake studies: clustering observed epicenters to discover danger areas for potential future earthquakes

    6. Requirements for a Clustering Algorithm Scalability Dealing with different types of attributes Discovering clusters with arbitrary shape Ability to deal with outliers and noise Insensitivity to order of input records Interpretability and usability

    7. Types of Clustering There are 3 main categories of clustering Partial Clustering : K-means Hierarchical : Agglomerative and Divisive Probabilistic: Mixture of Gaussians

    8. K-means Clustering Major clustering technique used due to its computational ease and memory efficiency Used by most search engines Popular unsupervised learning algorithm used in data mining

    9. How K-means works Main idea is to identify K centroids which are the mean of all points, for K number of clusters Each point is then associated to the nearest centroid Re-calculate the centroids Repeat process until centroids no longer change

    14. Hierarchical Clustering Traditional representation is a tree Agglomerative begins at leaves of tree, while Divisive begins at root Agglomerative clustering techniques are more commonly used compared to divisive techniques

    15. How Hierarchical Clustering works Each item is assigned to a cluster Find the most similar or closest pair of clusters and merge them into one Compute similarities or distances between the new cluster and the old clusters Repeat process until all items belong to a single cluster

    16. Agglomerative Clustering

    18. References http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/ http://www.resample.com/xlminer/help/HClst/HClst_intro.htm http://www.cs.cmu.edu/afs/andrew/course/15/381-f08/www/lectures/clustering.pdf

More Related