1 / 20

Clustering

Clustering. Gilad Lerman Math Department, UMN. Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore. What is Clustering?. Partitioning data into classes with high intra-class similarity low inter-class similarity Is it well-defined?. What is Similarity?.

shanon
Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Gilad Lerman Math Department, UMN Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore

  2. What is Clustering? • Partitioning data into classes with high intra-class similarity low inter-class similarity • Is it well-defined?

  3. What is Similarity? • Clearly, subjective measure or problem-dependent

  4. How Similar Clusters are? • Ex1: Two clusters or one clusters?

  5. How Similar Clusters are? • Ex2: Cluster or outliers

  6. Sum-Squares Intra-class Similarity • Given Cluster Mean: Within Cluster Sum of Squares: • Note that

  7. Within Cluster Sum of Squares • For Set of Clusters S={S1,…,SK} • Can use • So get Within Clusters Manhattan Distance • Question: how to compute/estimate c?

  8. Minimizing WCSS • Precise minimization is “NP-hard” • Approximate minimization for WCSS by K-means • Approximate minimization for WCMD by K-medians

  9. The K-means Algorithm • Input: Data & number of clusters (K) • Randomly guess locations of K cluster centers • For each center – assign nearest cluster • Repeat till convergence ….

  10. Demonstration: K-means/medians • Applet

  11. K-means: Pros and Cons • Pros • Often fast • Often terminates at a local minimum • Cons • May not obtain the global minimum • Depends on initialization • Need to specify K • Sensitive to outliers • Sensitive to variations in sizes and densities of clusters • Not suitable for non-convex shapes • Does not apply directly to categorical data

  12. Spectral Clustering Idea: embed data for easy clustering • Construct weights based on proximity: (Normalize W ) • Embed using eigenvectors of W

  13. Clustering vs. Classification • Clustering – find classes in an unsupervised way (often K is given though) • Classification – labels of clusters are given for some data points (supervised learning)

  14. Data 1: Face images • Facial images (e.g., of persons 5,8,10) live on different “planes” in the “image space” • They are often well-separated so that simple clustering can apply to them (but not always…) • Question: What is the high-dimensional image space? • Question: How can we present high-dim. data in 3D?

  15. Data 2: Iris Data Set • 50 samples from each of 3 species • 4 features per sample: length & width of sepal and petal Setosa Versicolor Virginica

  16. Data 2: Iris Data Set

  17. Data 2: Iris Data Set • Setosa is clearly separated from 2 others • Can’t separate Virginica and Versicolor (need training set as done by Fischer in 1936) • Question: What are other ways to visualize?

  18. Data 3: Color-based Compression of Images • Applet • Question: What are the actual data points? • Question: What does the error mean?

  19. Some methods for # of Clusters(with online codes) • Gap statistics • Model-based clustering • G-means • X-means • Data-spectroscopic clustering • Self-tuning clustering

  20. Your mission • Learn about clustering (theoretical results, algorithms, codes) • Focus: methods for determining # of clusters • Understand details • Compare using artificial and real data • Conclude good/bad scenarios for each (prove?) • Come up with new/improved methods • Summarize info: literature survey and possibly new/improved demos/applets • We can suggest additional questions tailored to your interest

More Related