clustering l.
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering PowerPoint Presentation
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 20

Clustering - PowerPoint PPT Presentation

  • Uploaded on

Clustering. Gilad Lerman Math Department, UMN. Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore. What is Clustering?. Partitioning data into classes with high intra-class similarity low inter-class similarity Is it well-defined?. What is Similarity?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Clustering' - shanon

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


Gilad Lerman

Math Department, UMN

Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore

what is clustering
What is Clustering?
  • Partitioning data into classes with

high intra-class similarity

low inter-class similarity

  • Is it well-defined?
what is similarity
What is Similarity?
  • Clearly, subjective measure or problem-dependent
how similar clusters are
How Similar Clusters are?
  • Ex1: Two clusters or one clusters?
how similar clusters are5
How Similar Clusters are?
  • Ex2: Cluster or outliers
sum squares intra class similarity
Sum-Squares Intra-class Similarity
  • Given Cluster


Within Cluster Sum of Squares:

  • Note that
within cluster sum of squares
Within Cluster Sum of Squares
  • For Set of Clusters S={S1,…,SK}
  • Can use
  • So get Within Clusters Manhattan Distance
  • Question: how to compute/estimate c?
minimizing wcss
Minimizing WCSS
  • Precise minimization is “NP-hard”
  • Approximate minimization for WCSS by K-means
  • Approximate minimization for WCMD by K-medians
the k means algorithm
The K-means Algorithm
  • Input: Data & number of clusters (K)
  • Randomly guess locations of K cluster centers
  • For each center – assign nearest cluster
  • Repeat till convergence ….
k means pros and cons
K-means: Pros and Cons
  • Pros
    • Often fast
    • Often terminates at a local minimum
  • Cons
    • May not obtain the global minimum
    • Depends on initialization
    • Need to specify K
    • Sensitive to outliers
    • Sensitive to variations in sizes and densities of clusters
    • Not suitable for non-convex shapes
    • Does not apply directly to categorical data
spectral clustering
Spectral Clustering

Idea: embed data for easy clustering

  • Construct weights based on proximity:

(Normalize W )

  • Embed using eigenvectors of W
clustering vs classification
Clustering vs. Classification
  • Clustering – find classes in an unsupervised way (often K is given though)
  • Classification – labels of clusters are given for some data points (supervised learning)
data 1 face images
Data 1: Face images
  • Facial images (e.g., of persons 5,8,10) live on different “planes” in the “image space”
  • They are often well-separated so that simple clustering can apply to them (but not always…)
  • Question: What is the high-dimensional image space?
  • Question: How can we present high-dim. data in 3D?
data 2 iris data set
Data 2: Iris Data Set
  • 50 samples from each of 3 species
  • 4 features per sample:

length & width of sepal and petal




data 2 iris data set17
Data 2: Iris Data Set
  • Setosa is clearly separated from 2 others
  • Can’t separate Virginica and Versicolor

(need training set as done by Fischer in 1936)

  • Question: What are other ways to visualize?
data 3 color based compression of images
Data 3: Color-based Compression of Images
  • Applet
  • Question: What are the actual data points?
  • Question: What does the error mean?
some methods for of clusters with online codes
Some methods for # of Clusters(with online codes)
  • Gap statistics
  • Model-based clustering
  • G-means
  • X-means
  • Data-spectroscopic clustering
  • Self-tuning clustering
your mission
Your mission
  • Learn about clustering (theoretical results, algorithms, codes)
  • Focus: methods for determining # of clusters
    • Understand details
    • Compare using artificial and real data
    • Conclude good/bad scenarios for each (prove?)
    • Come up with new/improved methods
  • Summarize info: literature survey and possibly new/improved demos/applets
  • We can suggest additional questions tailored to your interest