1 / 9

Data Mining CSCI 307, Spring 2019 Lecture 24

Data Mining CSCI 307, Spring 2019 Lecture 24. Clustering. Clustering. Clustering techniques apply when there is no class to be predicted Aim : divide instances into “ natural ” groups As we've seen clusters can be: disjoint vs. overlapping deterministic vs. probabilistic

barbaraz
Download Presentation

Data Mining CSCI 307, Spring 2019 Lecture 24

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data MiningCSCI 307, Spring 2019Lecture 24 Clustering

  2. Clustering • Clustering techniques apply when there is no class to be predicted • Aim: divide instances into “natural” groups • As we've seen clusters can be: • disjoint vs. overlapping • deterministic vs. probabilistic • flat vs. hierarchical • Look at a classic clustering algorithm: k-means • k-means clusters are disjoint, deterministic, and flat probabilistic • 1 2 3 • a 0.4 0.1 0.5 • b 0.1 0.8 0.1 hierarchical

  3. The k-means algorithm How to choose k (i.e. the number of clusters) is saved for Chapter 5, after we learn how to evaluate the success of ML. To cluster data into k (k is predefined number of clusters) groups: 0. Choose k cluster centers (at random) 1. Assign instances to clusters (based on distance to cluster centers) 2. Compute centroids (i.e. the mean of all the instances) of the clusters 3.Go to 1, until convergence

  4. Discussion • Algorithm minimizes squared distance to cluster centers • Result can vary significantly (based on initial choice of seeds) initial cluster centers • Can get trapped in local minimum, example: • To increase chance of finding global optimum: restart with different random seeds • k-means++ (procedure for finding better seeds, and ultimately improve the cluster)

  5. Discussion initial cluster centers initial cluster centers instances instances

  6. Need ==> Faster Distance Calculations Use kD-trees or ball trees to speed up the process (of finding the closest cluster)! • First, build tree, which remains static, for all the data points • At each node, store number of instances and sum of all instances • In each iteration, descend tree and find which cluster each node belongs to • Can stop descending as soon as we find out a node belongs entirely to a particular cluster • Use statistics stored at the nodes to compute new cluster centers

  7. No nodes need to be examined below the frontier

  8. Choosing the number of clusters Big question in practice: what is the right number of clusters, i.e., what is the right value for k? Cannot simply optimize squared distance on training data to choose k Squared distance decreases monotonically with increasing values of k Need some measure that balances distance with complexity of the model, e.g., based on the MDL principle (covered later) Finding the right-size model using MDL becomes easier when applying a recursive version of k-means (bisecting k-means): Compute A: information required to store data centroid, and the location of each instance with respect to this centroid Split data into two clusters using 2-means Compute B: information required to store the two new cluster centroids, and the location of each instance with respect to these two If A > B, split the data and recurse (just like in other tree learners)

More Related