Clustering I

Clustering I

The Task • Input: Collection of instances • No special class label attribute! • Output: Clusters (Groups) of instances where members of a cluster are more ‘similar’ to each other than they are to members of another cluster • Similarity between instances need to be defined • Because there are no predefined classes to predict, the task is to find useful groups of instances

Example – Iris Data

Example Clusters? Petal Width Petal Length

Different Types of Clusters • Clustering is not an end in itself • Only a means to an end defined by the goals of the parent application • A garment manufacturer might be looking for grouping people into different sizes and shapes (segmentation) • A biologist might be looking for gene groups with similar functions (natural groupings) • Application dependent notions of clusters possible • This means, you need different clustering techniques to find different types of clusters • Also you need to learn to find a clustering technique to match your objective • Sometimes an ideal match is not possible; therefore you adapt the existing • Sometimes, a wrongly selected technique might show unexpected clusters – which is what data mining is all about

g a b c d e f Cluster Representations • Several representations are possible e d e c d j h j b a c h a k b f k f i i g g Partitioning 2D space showing instances Venn Diagram Dendrogram Table of Cluster Probabilities

Several Techniques • Partitioning Methods • K-means and k-medoids • Hierarchical Methods • Agglomerative • Density-based Methods • DBSCAN

K-means Clustering • Input: the number of clusters K and the collection of n instances • Output: a set of k clusters that minimizes the squared error criterion • Method: • Arbitrarily choose k instances as the initial cluster centers • Repeat • (Re)assign each instance to the cluster to which the instance is the most similar, based on the mean value of the instances in the cluster • Update cluster means (compute mean value of the instances for each cluster) • Until no change in the assignment • Squared Error Criterion • E = ∑i=1k ∑ pЄCi |p-mi|2 • wheremi are the cluster means and p are points in clusters

Interactive Demo • http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

Choosing the number of Clusters • K-means method expects number of clusters k to be specified as part of the input • How to choose appropriate k? • Options • Start with k = 1 and work up to small number • Select the result with lowest squared error • Warning: Result with each instance in a different cluster on its own would be the best • Create two initial clusters and split each cluster independently • When splitting a cluster create two new ‘seeds’ • One seed at a point one standard deviation away from the cluster center and the second one standard deviation in the opposite direction • Apply k-means to the points in the cluster with the two new seeds

Choosing Initial Centroids • Randomly chosen centroids may produce poor quality clusters • How to choose initial centroids? • Options • Take a sample of data, apply hierarchical clustering and extract k cluster centroids • Works well – small sample sizes are desired • Take well separated k points • Take a sample of data and select the first centroid as the centroid of the sample • Each of the subsequent centroid is chosen to be a point farthest from any of the already selected centroids

Additional Issues • K-means might produce empty clusters • Need to choose an alternative centroid • Options • Choose the point farthest from the existing centroids • Choose the point from the cluster with the highest squared error • Squared error may need to be reduced further • Because K-means may converge to local minimum • Possible to reduce squared error without increasing K • Solution • Apply alternate phases of cluster splitting and merging • Splitting – split the cluster with the largest squared error • Merging – redistribute the elements of a cluster to its neighbours • Outliers in the data cause problems to k-means • Options • Remove outliers • Use K-Medoids instead

K-medoid • Input: the number of clusters K and the collection of n instances • Output: A set of k clusters that minimizes the sum of the dissimilarities of all the instances to their nearest medoids • Method: • Arbitrarily choose k instances as the initial medoids • Repeat • (Re)assign each remaining instance to the cluster with the nearest medoid • Randomly select a non-medoid instance, or • Compute the total cost, S, of swapping Oj with Or • If S<0 then swap Oj with Or to form the new set of k medoids • Until no change

K-means Vs K-medoids • Both require K to be specified in the input • K-medoids is less influenced by outliers in the data • K-medoids is computationally more expensive • Both methods assign each instance exactly to one cluster • Fuzzy partitioning techniques relax this • Partitioning methods generally are good at finding spherical shaped clusters • They are suitable for small and medium sized data sets • Extensions are required for working on large data sets

Clustering I

Clustering I

Presentation Transcript

Clustering

Clustering

Semi-Supervised Clustering I

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering