1 / 14

Clustering I

Clustering I. The Task. Input: Collection of instances No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster are more ‘similar’ to each other than they are to members of another cluster Similarity between instances need to be defined

truly
Download Presentation

Clustering I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering I

  2. The Task • Input: Collection of instances • No special class label attribute! • Output: Clusters (Groups) of instances where members of a cluster are more ‘similar’ to each other than they are to members of another cluster • Similarity between instances need to be defined • Because there are no predefined classes to predict, the task is to find useful groups of instances

  3. Example – Iris Data

  4. Example Clusters? Petal Width Petal Length

  5. Different Types of Clusters • Clustering is not an end in itself • Only a means to an end defined by the goals of the parent application • A garment manufacturer might be looking for grouping people into different sizes and shapes (segmentation) • A biologist might be looking for gene groups with similar functions (natural groupings) • Application dependent notions of clusters possible • This means, you need different clustering techniques to find different types of clusters • Also you need to learn to find a clustering technique to match your objective • Sometimes an ideal match is not possible; therefore you adapt the existing • Sometimes, a wrongly selected technique might show unexpected clusters – which is what data mining is all about

  6. g a b c d e f Cluster Representations • Several representations are possible e d e c d j h j b a c h a k b f k f i i g g Partitioning 2D space showing instances Venn Diagram Dendrogram Table of Cluster Probabilities

  7. Several Techniques • Partitioning Methods • K-means and k-medoids • Hierarchical Methods • Agglomerative • Density-based Methods • DBSCAN

  8. K-means Clustering • Input: the number of clusters K and the collection of n instances • Output: a set of k clusters that minimizes the squared error criterion • Method: • Arbitrarily choose k instances as the initial cluster centers • Repeat • (Re)assign each instance to the cluster to which the instance is the most similar, based on the mean value of the instances in the cluster • Update cluster means (compute mean value of the instances for each cluster) • Until no change in the assignment • Squared Error Criterion • E = ∑i=1k ∑ pЄCi |p-mi|2 • wheremi are the cluster means and p are points in clusters

  9. Interactive Demo • http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

  10. Choosing the number of Clusters • K-means method expects number of clusters k to be specified as part of the input • How to choose appropriate k? • Options • Start with k = 1 and work up to small number • Select the result with lowest squared error • Warning: Result with each instance in a different cluster on its own would be the best • Create two initial clusters and split each cluster independently • When splitting a cluster create two new ‘seeds’ • One seed at a point one standard deviation away from the cluster center and the second one standard deviation in the opposite direction • Apply k-means to the points in the cluster with the two new seeds

  11. Choosing Initial Centroids • Randomly chosen centroids may produce poor quality clusters • How to choose initial centroids? • Options • Take a sample of data, apply hierarchical clustering and extract k cluster centroids • Works well – small sample sizes are desired • Take well separated k points • Take a sample of data and select the first centroid as the centroid of the sample • Each of the subsequent centroid is chosen to be a point farthest from any of the already selected centroids

  12. Additional Issues • K-means might produce empty clusters • Need to choose an alternative centroid • Options • Choose the point farthest from the existing centroids • Choose the point from the cluster with the highest squared error • Squared error may need to be reduced further • Because K-means may converge to local minimum • Possible to reduce squared error without increasing K • Solution • Apply alternate phases of cluster splitting and merging • Splitting – split the cluster with the largest squared error • Merging – redistribute the elements of a cluster to its neighbours • Outliers in the data cause problems to k-means • Options • Remove outliers • Use K-Medoids instead

  13. K-medoid • Input: the number of clusters K and the collection of n instances • Output: A set of k clusters that minimizes the sum of the dissimilarities of all the instances to their nearest medoids • Method: • Arbitrarily choose k instances as the initial medoids • Repeat • (Re)assign each remaining instance to the cluster with the nearest medoid • Randomly select a non-medoid instance, or • Compute the total cost, S, of swapping Oj with Or • If S<0 then swap Oj with Or to form the new set of k medoids • Until no change

  14. K-means Vs K-medoids • Both require K to be specified in the input • K-medoids is less influenced by outliers in the data • K-medoids is computationally more expensive • Both methods assign each instance exactly to one cluster • Fuzzy partitioning techniques relax this • Partitioning methods generally are good at finding spherical shaped clusters • They are suitable for small and medium sized data sets • Extensions are required for working on large data sets

More Related