Pattern Recognition Chapter 8: Clustering Large Data Sets

Pattern RecognitionChapter 8: Clustering Large Data Sets First Semester 2013 Department of Computer Science Faculty of Science Chiang Mai University

Learning Objectives • What is clustering? • k-Means Algorithm 204453: Pattern Recognition

Clustering • Process of grouping a set of patterns • Clusters: Partition consisting of cohesive groups from a given collection of patterns • Unsupervised: Unlabelled patterns • Supervised: Labeled patterns • Similar: Patterns in the same cluster • Dissimilar: Patterns in different clusters 204453: Pattern Recognition

The input-output behavior ofa clustering algorithm 204453: Pattern Recognition

The input-output behavior of clustering 204453: Pattern Recognition

Cluster Distance • Intra: Small • Similarity: High • Inter: Large • Similarity: Low 204453: Pattern Recognition

A two-dimensional data set of10 vectors (cont.) 204453: Pattern Recognition

The ijth entry in the matrix isthe distance between Xi and Xj (threshold = 5) 204453: Pattern Recognition

K-Means Algorithm STEP 1: Select k out of the given n patterns as the initial cluster centres. Assign each of the remaining n – k patterns to one of the k clusters; a pattern is assigned to its closest centre/cluster. STEP 2: Compute the cluster centres based on the current assignment of patterns. 204453: Pattern Recognition

K-Means Algorithm (cont.) STEP 3: Assign each of the n patterns to its closest centre/cluster. STEP 4: If there is no change in the assignment of patterns to clusters during two successive iterations, then stop; else, go to step 2. * Selecting the initial clusters us a very important issue. 204453: Pattern Recognition

Optimal partition whenA, D and F are the initial means 204453: Pattern Recognition

Reference • Pattern Recognition: An Algorithmic Approach (Undergraduate Topics in Computer Science), M. Narasimha Murty and V. Susheela Devi, Springer, 2012 204453: Pattern Recognition

Pattern Recognition Chapter 8: Clustering Large Data Sets