Download Presentation
Clustering

Loading in 2 Seconds...

1 / 20

# Clustering - PowerPoint PPT Presentation

Clustering. Gilad Lerman Math Department, UMN. Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore. What is Clustering?. Partitioning data into classes with high intra-class similarity low inter-class similarity Is it well-defined?. What is Similarity?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

## PowerPoint Slideshow about 'Clustering' - shanon

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Clustering

Gilad Lerman

Math Department, UMN

Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore

What is Clustering?
• Partitioning data into classes with

high intra-class similarity

low inter-class similarity

• Is it well-defined?
What is Similarity?
• Clearly, subjective measure or problem-dependent
How Similar Clusters are?
• Ex1: Two clusters or one clusters?
How Similar Clusters are?
• Ex2: Cluster or outliers
Sum-Squares Intra-class Similarity
• Given Cluster

Mean:

Within Cluster Sum of Squares:

• Note that
Within Cluster Sum of Squares
• For Set of Clusters S={S1,…,SK}
• Can use
• So get Within Clusters Manhattan Distance
• Question: how to compute/estimate c?
Minimizing WCSS
• Precise minimization is “NP-hard”
• Approximate minimization for WCSS by K-means
• Approximate minimization for WCMD by K-medians
The K-means Algorithm
• Input: Data & number of clusters (K)
• Randomly guess locations of K cluster centers
• For each center – assign nearest cluster
• Repeat till convergence ….
K-means: Pros and Cons
• Pros
• Often fast
• Often terminates at a local minimum
• Cons
• May not obtain the global minimum
• Depends on initialization
• Need to specify K
• Sensitive to outliers
• Sensitive to variations in sizes and densities of clusters
• Not suitable for non-convex shapes
• Does not apply directly to categorical data
Spectral Clustering

Idea: embed data for easy clustering

• Construct weights based on proximity:

(Normalize W )

• Embed using eigenvectors of W
Clustering vs. Classification
• Clustering – find classes in an unsupervised way (often K is given though)
• Classification – labels of clusters are given for some data points (supervised learning)
Data 1: Face images
• Facial images (e.g., of persons 5,8,10) live on different “planes” in the “image space”
• They are often well-separated so that simple clustering can apply to them (but not always…)
• Question: What is the high-dimensional image space?
• Question: How can we present high-dim. data in 3D?
Data 2: Iris Data Set
• 50 samples from each of 3 species
• 4 features per sample:

length & width of sepal and petal

Setosa

Versicolor

Virginica

Data 2: Iris Data Set
• Setosa is clearly separated from 2 others
• Can’t separate Virginica and Versicolor

(need training set as done by Fischer in 1936)

• Question: What are other ways to visualize?
Data 3: Color-based Compression of Images
• Applet
• Question: What are the actual data points?
• Question: What does the error mean?
Some methods for # of Clusters(with online codes)
• Gap statistics
• Model-based clustering
• G-means
• X-means
• Data-spectroscopic clustering
• Self-tuning clustering
Your mission
• Learn about clustering (theoretical results, algorithms, codes)
• Focus: methods for determining # of clusters
• Understand details
• Compare using artificial and real data
• Conclude good/bad scenarios for each (prove?)
• Come up with new/improved methods
• Summarize info: literature survey and possibly new/improved demos/applets
• We can suggest additional questions tailored to your interest