Cluster Analysis and K-means Algorithm for Grouping Data

聚类分析 • 聚类的概念 • 基于k质心的聚类方法 • K-means 方法 • K-means 的性质 • C-means • 层次聚类

Inter-cluster distances are maximized Intra-cluster distances are minimized Cluster Analysis (聚类分析） • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Applications of Cluster Analysis • Understanding • Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations • Summarization • Reduce the size of large data sets Clustering precipitation in Australia

What is NOT Cluster Analysis? • Supervised classification • Have class label information • Simple segmentation • Dividing students into different registration groups alphabetically, by last name • Results of a query • Groupings are a result of an external specification • Graph partitioning • Some mutual relevance and synergy, but areas are not identical

How many clusters? Six Clusters Two Clusters Four Clusters Notion of a Cluster can be Ambiguous

Types of Clusterings • A clustering is a set of clusters • Important distinction between hierarchical and partitionalsets of clusters • Partitional Clustering (flat) • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree

A Partitional Clustering Partitional Clustering Original Points

Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Types of Clusters: Well-Separated • Well-Separated Clusters: • A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters

Types of Clusters: Center-Based • Center-based • A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster • The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters

k-means算法 • step1. 任意选择k个对象作为初始的类的中心 • step2. repeat • step3. 对每个文档d，发现与d最近的中心点x，把把d赋给x所代表的类 • step4. 对每个类中的顶点，重新计算该类顶点特征的平均值，并以这些特征的平均值产生一个新的中心。 • step5. until 类不再发生变化,即没有对象进行被重新分配时过程结束。

K-means Clustering – Details • Initial centroids are often chosen randomly. • Clusters produced vary from one run to another. • The centroid is (typically) the mean (期望,平均值） of the points in the cluster. • ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. • K-means will converge for common similarity measures mentioned above. • Most of the convergence （收敛） happens in the first few iterations. • Often the stopping condition is changed to ‘Until relatively few points change clusters’ • Complexity is O( n * K * I * d ) • n = number of points, K = number of clusters, I = number of iterations, d = number of attributes

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 The K-Means Clustering Method • Example 10 9 8 7 6 5 Update the cluster means Assign each objects to most similar center 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means

Optimal Clustering Sub-optimal Clustering Two different K-means Clusterings Original Points

Importance of Choosing Initial Centroids

Evaluating K-means Clusters • Most common measure is Sum of Squared Error (SSE) • For each point, the error is the distance to the nearest cluster • To get SSE, we square these errors and sum them. • x is a data point in cluster Ci and mi is the representative point for cluster Ci • can show that micorresponds to the center (mean) of the cluster • Given two clusters, we can choose the one with the smallest error.

Importance of Choosing Initial Centroids …

Problems with Selecting Initial Points • If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small. • Chance is relatively small when K is large • If clusters are the same size, n, then • For example, if K = 10, then probability = 10!/1010 = 0.00036 • Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t • Consider an example of five pairs of clusters

Solutions to Initial Centroids Problem • Multiple runs, choose the one with minimun SSE • Helps, but probability is not on your side • Sample and use hierarchical clustering to determine initial centroids • Select more than k initial centroids and then select among these initial centroids • Select most widely separated • Postprocessing

Handling Empty Clusters • Basic K-means algorithm can yield empty clusters • Several strategies • Choose the point that contributes most to SSE • Choose a point from the cluster with the highest SSE • If there are several empty clusters, the above can be repeated several times.

Updating Centers Incrementally • In the basic K-means algorithm, centroids are updated after all points are assigned to a centroid • An alternative is to update the centroids after each assignment (incremental approach) • Each assignment updates zero or two centroids • More expensive • Introduces an order dependency • Never get an empty cluster • Can use “weights” to change the impact

Pre-processing and Post-processing • Pre-processing • Normalize the data • Eliminate outliers • Post-processing • Eliminate small clusters that may represent outliers • Split ‘loose’ clusters, i.e., clusters with relatively high SSE • Merge clusters that are ‘close’ and that have relatively low SSE • Can use these steps during the clustering process • ISODATA

Limitations of K-means • K-means has problems when clusters are of differing • Sizes • Densities • Non-globular shapes • K-means has problems when the data contains outliers.

Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points

Limitations of K-means: Differing Density K-means (3 Clusters) Original Points

Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

Overcoming K-means Limitations Original Points K-means Clusters • One solution is to use many clusters. • Find parts of clusters, but need to put together.

Overcoming K-means Limitations Original Points K-means Clusters

Fuzzy c-means clustering • In fuzzy clustering, each point has a degree of belonging to clusters, rather than belonging completely to just one cluster. Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the center of cluster. For each point x we have a coefficient giving the degree of being in the kth cluster uk (x). Usually, the sum of those coefficients for any given x is defined to be 1:

: With fuzzy c-means, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster:

The degree of belonging is related to the inverse of the distance to the cluster center:

then the coefficients are normalized and fuzzy fied with a real parameter m > 1 so that their sum is 1. So D(x,y) = distance between x and y

For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When m is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.

C-means Algorithm • Choose a number of clusters. • Assign randomly to each point coefficients for being in the clusters. • Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than , the given sensitivity threshold) : • Compute the centroid for each cluster, using the formula above. • For each point, compute its coefficients of being in the clusters, using the formula above.

The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights.

Cluster Analysis and K-means Algorithm for Grouping Data

Cluster Analysis and K-means Algorithm for Grouping Data

Presentation Transcript