140 likes | 254 Views
Clustering is a crucial task in data analysis, focusing on grouping instances based on similarity without predefined class labels. This guide delves into organizing data into clusters, with examples such as the Iris data set. It covers essential concepts like selecting the number of clusters (K), determining initial centroids, and handling issues like empty clusters and outliers. We'll explore techniques including K-means, K-medoids, hierarchical, and density-based methods, emphasizing their applications in various fields, from garment manufacturing to biological research.
E N D
The Task • Input: Collection of instances • No special class label attribute! • Output: Clusters (Groups) of instances where members of a cluster are more ‘similar’ to each other than they are to members of another cluster • Similarity between instances need to be defined • Because there are no predefined classes to predict, the task is to find useful groups of instances
Example Clusters? Petal Width Petal Length
Different Types of Clusters • Clustering is not an end in itself • Only a means to an end defined by the goals of the parent application • A garment manufacturer might be looking for grouping people into different sizes and shapes (segmentation) • A biologist might be looking for gene groups with similar functions (natural groupings) • Application dependent notions of clusters possible • This means, you need different clustering techniques to find different types of clusters • Also you need to learn to find a clustering technique to match your objective • Sometimes an ideal match is not possible; therefore you adapt the existing • Sometimes, a wrongly selected technique might show unexpected clusters – which is what data mining is all about
g a b c d e f Cluster Representations • Several representations are possible e d e c d j h j b a c h a k b f k f i i g g Partitioning 2D space showing instances Venn Diagram Dendrogram Table of Cluster Probabilities
Several Techniques • Partitioning Methods • K-means and k-medoids • Hierarchical Methods • Agglomerative • Density-based Methods • DBSCAN
K-means Clustering • Input: the number of clusters K and the collection of n instances • Output: a set of k clusters that minimizes the squared error criterion • Method: • Arbitrarily choose k instances as the initial cluster centers • Repeat • (Re)assign each instance to the cluster to which the instance is the most similar, based on the mean value of the instances in the cluster • Update cluster means (compute mean value of the instances for each cluster) • Until no change in the assignment • Squared Error Criterion • E = ∑i=1k ∑ pЄCi |p-mi|2 • wheremi are the cluster means and p are points in clusters
Interactive Demo • http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
Choosing the number of Clusters • K-means method expects number of clusters k to be specified as part of the input • How to choose appropriate k? • Options • Start with k = 1 and work up to small number • Select the result with lowest squared error • Warning: Result with each instance in a different cluster on its own would be the best • Create two initial clusters and split each cluster independently • When splitting a cluster create two new ‘seeds’ • One seed at a point one standard deviation away from the cluster center and the second one standard deviation in the opposite direction • Apply k-means to the points in the cluster with the two new seeds
Choosing Initial Centroids • Randomly chosen centroids may produce poor quality clusters • How to choose initial centroids? • Options • Take a sample of data, apply hierarchical clustering and extract k cluster centroids • Works well – small sample sizes are desired • Take well separated k points • Take a sample of data and select the first centroid as the centroid of the sample • Each of the subsequent centroid is chosen to be a point farthest from any of the already selected centroids
Additional Issues • K-means might produce empty clusters • Need to choose an alternative centroid • Options • Choose the point farthest from the existing centroids • Choose the point from the cluster with the highest squared error • Squared error may need to be reduced further • Because K-means may converge to local minimum • Possible to reduce squared error without increasing K • Solution • Apply alternate phases of cluster splitting and merging • Splitting – split the cluster with the largest squared error • Merging – redistribute the elements of a cluster to its neighbours • Outliers in the data cause problems to k-means • Options • Remove outliers • Use K-Medoids instead
K-medoid • Input: the number of clusters K and the collection of n instances • Output: A set of k clusters that minimizes the sum of the dissimilarities of all the instances to their nearest medoids • Method: • Arbitrarily choose k instances as the initial medoids • Repeat • (Re)assign each remaining instance to the cluster with the nearest medoid • Randomly select a non-medoid instance, or • Compute the total cost, S, of swapping Oj with Or • If S<0 then swap Oj with Or to form the new set of k medoids • Until no change
K-means Vs K-medoids • Both require K to be specified in the input • K-medoids is less influenced by outliers in the data • K-medoids is computationally more expensive • Both methods assign each instance exactly to one cluster • Fuzzy partitioning techniques relax this • Partitioning methods generally are good at finding spherical shaped clusters • They are suitable for small and medium sized data sets • Extensions are required for working on large data sets