1 / 78

Clustering different types of data

Clustering different types of data. Pasi Fränti. 21.3.2017. Numeric Binary Categorical Text Time series. Data types. Part I: Numeric data. Distance measures. Definition of distance metric.

jnewton
Download Presentation

Clustering different types of data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering different types of data Pasi Fränti 21.3.2017

  2. Numeric Binary Categorical Text Time series Data types

  3. Part I:Numeric data

  4. Distance measures

  5. Definition of distance metric A distance function is metric if the following conditions are met for all data points x, y, z: • All distances are non-negative: d(x, y) ≥ 0 • Distance to point itself is zero: d(x, x) = 0 • All distances are symmetric: d(x, y) = d(y, x) • Triangular inequality: d(x, y) d(x, z) + d(z, y)

  6. Minkowski distance Euclidean distance q = 2 Manhattan distance q = 1 Common distance metrics 1st dimension 2nd dimension pth dimension Xj = (xj1, xj2, …, xjp) dij = ? Xi = (xi1, xi2, …, xip)

  7. 2D example x1 = (2,8) x2 = (6,3) Euclidean distance Manhattan distance Distance metrics example 10 5 0 5 10 X1 = (2,8) 5 X2 = (6,3) 4

  8. Chebyshev distance In case of q , the distance equals to the maximum difference of the attributes. Useful if the worst case must be avoided: Example:

  9. Hierarchical clusteringCost functions • Three cost functions exist: • Single linkage • Complete linkage • Average linkage

  10. Single Link The smallest distance between vectors in clusters i and j: Cluster 1 Cluster 2 xi Min distance xj

  11. Complete Link The largest distance between vectors in clusters i and j: Cluster 1 Cluster 2 xj Max distance xi

  12. Average Link The average distance between vectors in clusters i and j: Cluster 1 Cluster 2 Av. distance xj xi

  13. Cost function example[Theodoridis, Koutroumbas, 2006] 1 1.1 1.2 1.3 1.4 1.5 x1 x1 x1 x2 x2 x2 x3 x3 x3 x4 x4 x4 x5 x5 x5 x6 x6 x6 x7 x7 x7 Data Set Single Link: Complete Link:

  14. Part II:Binary data

  15. Hamming Distance(Binary and categorical data) Number of different attribute values. Distance of (1011101) and (1001001) is 2. Distance (2143896) and (2233796) Distance between (toned) and (roses) is 3. 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path) 3-bit binary cube

  16. Hard thresholding of centroid (0.40, 0.60, 0.75, 0.20, 0.45, 0.25)

  17. Hard and soft centroids Bridge (binary version)

  18. Distance and distortion General distance function: Distortion function:

  19. Distortion for binary data Cost of a single attribute: The number of zeroes is qjk, the number of ones is rjkand cjk is the current centroid value for variable k of group j.

  20. Optimal centroid position Optimal centroid position depends on the metric. Given parameter: The optimal position is:

  21. Example of centroid location

  22. Centroid location

  23. Categorical clustering Three attributes

  24. Categorical clustering Sample 2-d data: color and shape Model A Model B Model C

  25. Hamming Distance(Binary and categorical data) • Number of different attribute values. • Distance of (1011101) and (1001001) is 2. • Distance (2143896) and (2233796) • Distance between (toned) and (roses) is 3. 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path) 3-bit binary cube

  26. K-means variants Histogram-based methods: Methods: • k-modes • k-medoids • k-distributions • k-histograms • k-populations • k-representatives

  27. Category utility: Entropy-based cost functions Entropy of data set: Entropies of the clusters relative to the data:

  28. Iterative algorithms

  29. K-modes clusteringDistance function Vector and mode A F I A D G Distance +1 2 +1

  30. K-modes clusteringPrototype of cluster Vectors Mode A D G B D H A F I A D

  31. K-medoids clusteringPrototype of cluster Vector with minimal total distance to others 3 Medoid: 2 2 A C E B C F B D G B C F 2+3=5 2+2=4 2+3=5

  32. K-medoidsExample

  33. K-medoidsCalculation

  34. K-histograms D 2/3 F 1/3

  35. K-distributionsCost function with ε addition

  36. Example of cluster allocationChange of entropy

  37. Problem of non-convergenceNon-convergence

  38. Results with Census dataset

  39. Literature Modified k-modes + k-histograms:M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), 503-507, March, 2007. ACE:K. Chen and L. Liu, The “Best k'' for entropy-based categorical data clustering, Int. Conf. on Scientific and Statistical Database Management (SSDBM'2005), pp. 253-262, Berkeley, USA, 2005. ROCK:S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”, Information Systems, Vol. 25, No. 5, pp. 345-366, 200x. K-medoids:L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley Sons, New York, 1990. K-modes:Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data mining knowledge discovery, Vol. 2, No. 3, pp. 283-304, 1998. K-distributions:Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int. Conf. on Intelligent Computing (ICIC 2007), pp. 436-443, Qingdao, China, 2007. K-histograms:Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset, CoRR, abs/cs/0509033, http://arxiv.org/abs/cs/0509033, 2005.

  40. Part IV:Text data

  41. Applications of text clustering • Query relaxation • Spell-checking • Automatic categorization • Document clustering

  42. Query relaxation Alternate solutionFrom semantic clustering Current solutionMatching suffixes from database

  43. Spell-checking Word kahvila (café): • one correct • two incorrect spellings

  44. Automatic categorization Category by clustering

  45. Motivation: Group related documents based on their content No predefined training set (taxonomy) Generate a taxonomy at runtime Clustering Process: Data preprocessing: tokenize, remove stop words, stem, feature extraction and lexical analysis Define cost function Perform clustering Document clustering

  46. String similarity is the basis for clustering text data A measure is required to calculate the similarity between two strings Text clustering

  47. Semantic: car and auto отельand Гостиница лапкаand слякоть Syntactic: automobile and auto отельand готель saunaand sana String similarity

  48. object artifact instrumentality article conveyance, transport ware vehicle table ware wheeled vehicle cutlery, eating utensil fork bike, bicycle automotive, motor car, auto truck Semantic similarity Lexical database: WordNet English Relations via generalization Sets of synonyms (synsets)

  49. Input : word1: wolf , word 2: hunting dog Output: similarity value = 0.89 Similarity using WordNet [Wuand Palmer, 2004]

  50. Hierarchical clustering by WordNet Need better

More Related