Understanding K-Means Clustering Algorithm in Data Analysis

Tópicos Especiais em Aprendizagem Reinaldo Bianchi Centro Universitário da FEI 2012

4a. Aula ParteB

O algoritmo K-means

K-Means • Algoritmo muito conhecido para agrupamento (clustering) de padrões. • Usado quando se pode definir o número de agrupamentos: • Escolha o número de agrupamentos desejado. • Escolha centros e membros dos agrupamentos de modo a minimizar o erro. • Não pode ser feito por busca: • muitos parâmetros.

K-Means • Algoritmo: • Fixe os centros dos agrupamentos. • Aloque os pontos para o agrupamento mais próximo. • Recalcule os centros dos clusters, como sendo a média dos pontos que ele representa. • Repita até que os centros parem de se mover.

K-Means • Pode ser usado para qualquer atributo para o qual se pode calcular uma distância…

Clustering • Partitioning Clustering Approach: • a typical clustering analysis approach via partitioning data set iteratively • construct a partition of a data set to produce several non-empty clusters (usually, the number of clusters given in advance) • in principle, partitions achieved via minimising the sum of squared distance in each cluster

Clustering • Given a K, find a partition of K clusters to optimise the chosen partitioning criterion: • global optimal: exhaustively enumerate all partitions • Heuristic method: K-means algorithm (MacQueen’67): • each cluster is represented by the center of the cluster and the algorithm converges to stable centers of clusters.

Algorithm Given the cluster number K, the K-means algorithm is carried out in three steps: • Initialisation: set seed points • Assign each object to the cluster with the nearest seed point; • Compute seed points as the centroids of the clusters of the current partition (the centroid is the centre, i.e., mean point, of the cluster) • Go back to Step 1), • stop when no more new assignment

Example • Suppose we have 4 types of medicines and each has two attributes: • pH and • weight index. • Our goal is to group these objects into K=2 group of medicine.

D C A B Example

Assign each object to the cluster with the nearest seed point Step 1: Use initial seed points for partitioning Euclidean distance

Step 2: Compute new centroids of the current partition Knowing the members of each cluster, now we compute the new centroid of each group based on these new memberships.

Step 2: Renew membership based on new centroids Compute the distance of all objects to the new centroids Assign the membership to objects

Step 3: Repeat the first two steps until its convergence Knowing the members of each cluster, now we compute the new centroid of each group based on these new memberships.

Repeat the first two steps until its convergence Compute the distance of all objects to the new centroids Stop due to no new assignment

K-means Demo • User set up the number of clusters they’d like. (e.g. k=5)

K-means Demo • User set up the number of clusters they’d like. (e.g. K=5) • Randomly guess K cluster Center locations

K-means Demo • User set up the number of clusters they’d like. (e.g. K=5) • Randomly guess K cluster Center locations • Each data point finds out which Center it’s closest to. (Thus each Center “owns” a set of data points)

K-means Demo • User set up the number of clusters they’d like. (e.g. K=5) • Randomly guess K cluster centre locations • Each data point finds out which centre it’s closest to. (Thus each Center “owns” a set of data points) • Each centre finds the centroid of the points it owns

K-means Demo • User set up the number of clusters they’d like. (e.g. K=5) • Randomly guess K cluster centre locations • Each data point finds out which centre it’s closest to. (Thus each centre “owns” a set of data points) • Each centre finds the centroid of the points it owns • …and jumps there

K-means Demo • User set up the number of clusters they’d like. (e.g. K=5) • Randomly guess K cluster centre locations • Each data point finds out which centre it’s closest to. (Thus each centre “owns” a set of data points) • Each centre finds the centroid of the points it owns • …and jumps there • …Repeat until terminated!

Exemplo K-means no Matlab

Exemplo k-means no iPad

Relevant Issues • Efficient in computation • O(tKn), where n is number of objects, K is number of clusters, and t is number of iterations. Normally, K, t << n. • Local optimum • sensitive to initial seed points • converge to a local optimum that may be unwanted solution

Relevant Issues • Other problems • Need to specify K, the number of clusters, in advance • Unable to handle noisy data and outliers (K-Medoids algorithm) • Not suitable for discovering clusters with non-convex shapes • Applicable only when mean is defined, then what about categorical data? (K-mode algorithm)

Cluster Validity With different initial conditions, the K-means algorithm may result in different partitions for a given data set. Which partition is the “best” one for the given data set? In theory, no answer to this question as there is no ground-truth available in unsupervised learning

Cluster Validity • Example: the ratio of the total between-cluster to the total within-cluster distances: • Between-cluster distance (BCD): the distance between means of two clusters • Within-cluster distance (WCD): sum of all distance between data points and the mean in a specific cluster • A large ratio of BCD:WCD suggests good compactness inside clusters and good separability among different clusters!

Conclusion • K-means algorithm is a simple yet popular method for clustering analysis • There are several variants of K-means to overcome its weaknesses • K-Medoids: resistance to noise and/or outliers • K-Modes: extension to categorical data clustering analysis • CLARA: dealing with large data sets • Mixture models (EM algorithm): handling uncertainty of clusters

E no Matlab?

E no Matlab? • Sintaxe: • IDX = kmeans(X,k) • Descrição: • Partitionsthepoints in the n-by-p data matrix X into k clusters. • Thisiterativepartitioningminimizesthe sum, overallclusters, of thewithin-clustersums of point-to-cluster-centroiddistances. • returnsan n-by-1 vector IDX containingtheclusterindices of eachpoint.

Ransac

RANSAC • RANdomSAmple Consensus. • Alternativa para procurar bons pontos para gerar o ajuste da reta. • Idéia: • Escolha um subconjunto uniforme de maneira aleatória (pontos de suporte). • Ajuste a reta para esses pontos. • Tudo que se encontra longe do ajuste é ruído. • Repita muitas vezes e escolha o melhor ajuste.

RANSAC • Problemas: • Quantas vezes executar? • O mínimo possível… • Qual o tamanho do subconjunto? • O menor possível… • O que é próximo? • Basta estimar a ordem de magnitude… • O que é um bom ajuste? • Um que o número de pontos próximos é tão grande que seja improvável que todos sejam ruído.

11 supports 4 supports RANSAC – Example How many samples do we need to draw?

RANSAC – How many samples • How many samples we need to ensure with a probability p, that at least one of the random samples of S points is free from outliners. (w: inlier probability)

TheRansacSong

Conclusão

Conclusão • Terminamos de ver os métodos de aprendizado de máquina puramente estatísticos. • K-NN, Mínimos Quadrados, PCA, LDA, k-Means • A partir da próxima aula veremos métodos nãomaisestatísticos, mas probabilísticos.

Links • Exemplosextraidos de: • www.cs.manchester.ac.uk/ugt/FCOMP24111/materials/slides/K-means.ppt

Understanding K-Means Clustering Algorithm in Data Analysis

Understanding K-Means Clustering Algorithm in Data Analysis

Presentation Transcript

Tópicos Especiais em Aprendizagem