Clustering methods course code 175314
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

Clustering methods Course code: 175314 PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on
  • Presentation posted in: General

Clustering methods Course code: 175314. Part 1: Introduction. Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND. Sample data. Sources of R G B vectors. Red - Green plot of the vectors. Sample data.

Download Presentation

Clustering methods Course code: 175314

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Clustering methods course code 175314

Clustering methodsCourse code: 175314

Part 1: Introduction

Pasi Fränti

10.3.2014

Speech & Image Processing Unit

School of Computing

University of Eastern Finland

Joensuu, FINLAND


Sample data

Sample data

Sources of RGB vectors

Red-Green plot of the vectors


Sample data1

Sample data

Employment statistics:


Application example 1 color reconstruction

Application example 1Color reconstruction

Image with original colors

Image with compression artifacts


Application example 2 speaker modeling for voice biometrics

Application example 2speaker modeling for voice biometrics

Tomi

Feature extraction

and clustering

Mikko

Tomi

Matti

Matti

Training data

Mikko

Feature extraction

Speaker models

?

Best match: Matti !


Speaker modeling

Speaker modeling

Speech data

Result of clustering


Application example 3 image segmentation

Application example 3Image segmentation

Image with 4 color clusters

Normalized color plots according to red and green components.

green

red


Application example 4 quantization

Application example 4Quantization

Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values

Quantized signal

Original signal


Color quantization of images

Color quantization of images

Color image

RGB samples

Clustering


Application example 5 clustering of spatial data

Application example 5Clustering of spatial data


Clustered locations of users

Clustered locations of users


Clustered locations of users1

Timeline clustering

Clustering of photos

Clustered locations of users


Clustering gps trajectories mobile users taxi routes fleet management

Clustering GPS trajectoriesMobile users, taxi routes, fleet management


Conclusions from clusters

Conclusions from clusters

Cluster 2: Home

Cluster 1: Office


Part i clustering problem

Part I:Clustering problem


Subproblems of clustering

Subproblems of clustering

  • Where are the clusters?(Algorithmic problem)

  • How many clusters?(Methodological problem: which criterion?)

  • Selection of attributes (Application related problem)

  • Preprocessing the data(Practical problems: normalization, outliers)


Clustering result as partition

Clustering result as partition

Partition of data

Cluster prototypes

Illustrated by Voronoi diagram

Illustrated by Convex hulls


Clustering methods course code 175314

Duality of partition and centroids

Partition of data

Cluster prototypes

Partition by nearestprototype mapping

Centroids as prototypes


Clustering methods course code 175314

Challenges in clustering

Incorrect cluster allocation

Incorrect number of clusters

Too many clusters

Clusters missing

Cluster missing


How to solve

How to solve?

Algorithmic problem

Mathematical problem

Computer science problem

Solve the clustering:

  • Given input data (X) of N data vectors, and number of clusters (M), find the clusters.

  • Result given as a set of prototypes, or partition.

    Solve the number of clusters:

  • Define appropriate cluster validity function f.

  • Repeat the clustering algorithm for several M.

  • Select the best result according to f.

    Solve the problem efficiently.


Taxonomy of clustering jain murty flynn data clustering a review acm computing surveys 1999

Taxonomy of clustering[Jain, Murty, Flynn, Data clustering: A review, ACM Computing Surveys, 1999.]

  • One possible classification based on cost function.

  • MSE is well defined and most popular.


Definitions and data

Definitions and data

Set of N data points:

X={x1, x2, …, xN}

Partition of the data:

P={p1, p2, …, pM},

Set of M cluster prototypes (centroids):

C={c1, c2, …, cM},


Distance and cost function

Distance and cost function

Euclidean distance of data vectors:

Mean square error:


Dependency of data structures

Dependency of data structures

  • Centroid condition: for a given partition (P), optimal cluster centroids (C) for minimizing MSE are the average vectors of the clusters:

  • Optimal partition: for a given centroids (C), optimal partition is the one with nearest centroid :


Complexity of clustering

Complexity of clustering

  • Number of possible clusterings:

  • Clustering problem is NP complete [Garey et al., 1982]

  • Optimal solution by branch-and-bound in exponential time.

  • Practical solutions by heuristic algorithms.


Cluster software

Outputarea

Main area

Input area

Cluster software

http://cs.joensuu.fi/sipu/soft/cluster2009.exe

  • Main area: working space for data

  • Input area: inputs to be processed

  • Output area:obtained results

  • Menu Process:selection of operation


Clustering methods course code 175314

Procedure to simulate k-means

Clustering image

Data set

Codebook

Partition

Open data set (file *.ts), move it into Input area

Process – Random codebook, select number of clusters

REPEAT

Move obtained codebook from Output area into Input area

Process – Optimal partition, select Error function

Move codebook into Main area, partition into Input area

Process – Optimal codebook

UNTIL DESIRED CLUSTERING


Xlminer software

XLMiner software

http://www.resample.com/xlminer/help/HClst/HClst_ex.htm


Example of data in xlminer

Example of data in XLMiner


Distance matrix dendrogram

Distance matrix & dendrogram


Conclusions

Conclusions

  • Clustering is a fundamental tools needed in Speech and Image processing.

  • Failing to do clustering properly may defect the application analysis.

  • Good clustering tool needed so that researchers can focus on application requirements.


Literature

Literature

  • S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 3rd edition, 2006.

  • C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

  • A.K. Jain, M.N. Murty and P.J. Flynn, Data clustering: A review, ACM Computing Surveys, 31(3): 264-323, September 1999.

  • M.R. Garey, D.S. Johnson and H.S. Witsenhausen, The complexity of the generalized Lloyd-Max problem, IEEE Transactions on Information Theory, 28(2): 255-256, March 1982.

  • F. Aurenhammer: Voronoi diagrams-a survey of a fundamental geometric data structure, ACM Computing Surveys, 23 (3), 345-405, September 1991.


  • Login