1 / 32

Clustering methods Course code: 175314

Clustering methods Course code: 175314. Part 1: Introduction. Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND. Sample data. Sources of R G B vectors. Red - Green plot of the vectors. Sample data.

arden
Download Presentation

Clustering methods Course code: 175314

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering methodsCourse code: 175314 Part 1: Introduction Pasi Fränti 10.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND

  2. Sample data Sources of RGB vectors Red-Green plot of the vectors

  3. Sample data Employment statistics:

  4. Application example 1Color reconstruction Image with original colors Image with compression artifacts

  5. Application example 2speaker modeling for voice biometrics Tomi Feature extraction and clustering Mikko Tomi Matti Matti Training data Mikko Feature extraction Speaker models ? Best match: Matti !

  6. Speaker modeling Speech data Result of clustering

  7. Application example 3Image segmentation Image with 4 color clusters Normalized color plots according to red and green components. green red

  8. Application example 4Quantization Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values Quantized signal Original signal

  9. Color quantization of images Color image RGB samples Clustering

  10. Application example 5Clustering of spatial data

  11. Clustered locations of users

  12. Timeline clustering Clustering of photos Clustered locations of users

  13. Clustering GPS trajectoriesMobile users, taxi routes, fleet management

  14. Conclusions from clusters Cluster 2: Home Cluster 1: Office

  15. Part I:Clustering problem

  16. Subproblems of clustering • Where are the clusters?(Algorithmic problem) • How many clusters?(Methodological problem: which criterion?) • Selection of attributes (Application related problem) • Preprocessing the data(Practical problems: normalization, outliers)

  17. Clustering result as partition Partition of data Cluster prototypes Illustrated by Voronoi diagram Illustrated by Convex hulls

  18. Duality of partition and centroids Partition of data Cluster prototypes Partition by nearestprototype mapping Centroids as prototypes

  19. Challenges in clustering Incorrect cluster allocation Incorrect number of clusters Too many clusters Clusters missing Cluster missing

  20. How to solve? Algorithmic problem Mathematical problem Computer science problem Solve the clustering: • Given input data (X) of N data vectors, and number of clusters (M), find the clusters. • Result given as a set of prototypes, or partition. Solve the number of clusters: • Define appropriate cluster validity function f. • Repeat the clustering algorithm for several M. • Select the best result according to f. Solve the problem efficiently.

  21. Taxonomy of clustering[Jain, Murty, Flynn, Data clustering: A review, ACM Computing Surveys, 1999.] • One possible classification based on cost function. • MSE is well defined and most popular.

  22. Definitions and data Set of N data points: X={x1, x2, …, xN} Partition of the data: P={p1, p2, …, pM}, Set of M cluster prototypes (centroids): C={c1, c2, …, cM},

  23. Distance and cost function Euclidean distance of data vectors: Mean square error:

  24. Dependency of data structures • Centroid condition: for a given partition (P), optimal cluster centroids (C) for minimizing MSE are the average vectors of the clusters: • Optimal partition: for a given centroids (C), optimal partition is the one with nearest centroid :

  25. Complexity of clustering • Number of possible clusterings: • Clustering problem is NP complete [Garey et al., 1982] • Optimal solution by branch-and-bound in exponential time. • Practical solutions by heuristic algorithms.

  26. Outputarea Main area Input area Cluster software http://cs.joensuu.fi/sipu/soft/cluster2009.exe • Main area: working space for data • Input area: inputs to be processed • Output area:obtained results • Menu Process:selection of operation

  27. Procedure to simulate k-means Clustering image Data set Codebook Partition Open data set (file *.ts), move it into Input area Process – Random codebook, select number of clusters REPEAT Move obtained codebook from Output area into Input area Process – Optimal partition, select Error function Move codebook into Main area, partition into Input area Process – Optimal codebook UNTIL DESIRED CLUSTERING

  28. XLMiner software http://www.resample.com/xlminer/help/HClst/HClst_ex.htm

  29. Example of data in XLMiner

  30. Distance matrix & dendrogram

  31. Conclusions • Clustering is a fundamental tools needed in Speech and Image processing. • Failing to do clustering properly may defect the application analysis. • Good clustering tool needed so that researchers can focus on application requirements.

  32. Literature • S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 3rd edition, 2006. • C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. • A.K. Jain, M.N. Murty and P.J. Flynn, Data clustering: A review, ACM Computing Surveys, 31(3): 264-323, September 1999. • M.R. Garey, D.S. Johnson and H.S. Witsenhausen, The complexity of the generalized Lloyd-Max problem, IEEE Transactions on Information Theory, 28(2): 255-256, March 1982. • F. Aurenhammer: Voronoi diagrams-a survey of a fundamental geometric data structure, ACM Computing Surveys, 23 (3), 345-405, September 1991.

More Related