K-Means Algorithm for Clustering

Data MiningCSCI 307, Spring 2019Lecture 24 Clustering

Clustering • Clustering techniques apply when there is no class to be predicted • Aim: divide instances into “natural” groups • As we've seen clusters can be: • disjoint vs. overlapping • deterministic vs. probabilistic • flat vs. hierarchical • Look at a classic clustering algorithm: k-means • k-means clusters are disjoint, deterministic, and flat probabilistic • 1 2 3 • a 0.4 0.1 0.5 • b 0.1 0.8 0.1 hierarchical

The k-means algorithm How to choose k (i.e. the number of clusters) is saved for Chapter 5, after we learn how to evaluate the success of ML. To cluster data into k (k is predefined number of clusters) groups: 0. Choose k cluster centers (at random) 1. Assign instances to clusters (based on distance to cluster centers) 2. Compute centroids (i.e. the mean of all the instances) of the clusters 3.Go to 1, until convergence

Discussion • Algorithm minimizes squared distance to cluster centers • Result can vary significantly (based on initial choice of seeds) initial cluster centers • Can get trapped in local minimum, example: • To increase chance of finding global optimum: restart with different random seeds • k-means++ (procedure for finding better seeds, and ultimately improve the cluster)

Discussion initial cluster centers initial cluster centers instances instances

Need ==> Faster Distance Calculations Use kD-trees or ball trees to speed up the process (of finding the closest cluster)! • First, build tree, which remains static, for all the data points • At each node, store number of instances and sum of all instances • In each iteration, descend tree and find which cluster each node belongs to • Can stop descending as soon as we find out a node belongs entirely to a particular cluster • Use statistics stored at the nodes to compute new cluster centers

No nodes need to be examined below the frontier

Choosing the number of clusters Big question in practice: what is the right number of clusters, i.e., what is the right value for k? Cannot simply optimize squared distance on training data to choose k Squared distance decreases monotonically with increasing values of k Need some measure that balances distance with complexity of the model, e.g., based on the MDL principle (covered later) Finding the right-size model using MDL becomes easier when applying a recursive version of k-means (bisecting k-means): Compute A: information required to store data centroid, and the location of each instance with respect to this centroid Split data into two clusters using 2-means Compute B: information required to store the two new cluster centroids, and the location of each instance with respect to these two If A > B, split the data and recurse (just like in other tree learners)

K-Means Algorithm for Clustering

K-Means Algorithm for Clustering

Presentation Transcript

Data Mining CSCI 307 Spring, 2019

Data Mining CSCI 307, Spring 2019 Lecture 13

Data Structures CSCI 132, Spring 2019 Lecture 21 Doubly Linked Lists

CSci 8980: Data Mining (Fall 2002)

Data Structures CSCI 132, Spring 2014 Lecture 17 Backtracking

Data Structures CSCI 132, Spring 2019 Lecture 14 Review for Exam 1

Data Mining Spring 2013

Data Structures CSCI 132, Spring 2019 Lecture 18 Recursion and Look-Ahead

Data Mining Spring 2007