1 / 49

AMCS/CS 340: Data Mining

Clustering. AMCS/CS 340: Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Grouping fruits. Grouping apple with apple , orange with orange and banana with banana. 2. Give pictures to a computer. 3. C hange pictures to data. 4.

ania
Download Presentation

AMCS/CS 340: Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

  2. Grouping fruits Grouping applewith apple, orangewith orangeand bananawith banana 2

  3. Give pictures to a computer 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  4. Change pictures to data 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  5. Change pictures to data x3 x2 x1 x4 x6 x5 x9 x8 x7 ……xn 5 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  6. Use clustering methods 1 2 1 2 3 2 3 1 3 ... … x1 x2 x3 x4 x5 x6 x7 x8 x9 ... … Output: clustering indicator Clustering Method 6 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  7. Correct? 1 2 1 2 3 2 3 1 3 ... … x1 x2 x3 x4 x5 x6 x7 x8 x9 ... … Output: clustering indicator Clustering Method 7 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  8. Cluster Analysis • What is Cluster Analysis? • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Methods • Clustering High-Dimensional Data • How to decide the number of clusters? 8 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  9. Inter-cluster distances are maximized Intra-cluster distances are minimized What is Cluster Analysis? • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Unsupervised learning: no predefined classes 9 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  10. Understanding As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms E.g., group related documents for browsing, groupgenes and proteins that have similar functionality, or groupstocks with similar price fluctuations Summarization Reduce the size of large data sets Image segmentation/compression Preserve Privacy (e.g., in medical data) Applications of Cluster Analysis 10 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  11. A clustering is a set of clusters Notion of a Cluster can be Ambiguous A set of data points A clustering with Six Clusters A clustering with Two Clusters A clustering with Four Clusters Clustering 11 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  12. Quality: What is Good Clustering? • A good clustering method will produce high quality clusters with • high intra-classsimilarity • low inter-class similarity • The quality of a clustering result depends on both the implementation of a method and the similarity measure used by this method • The qualityof a clustering method is also measured by its ability to discover some or all of the hidden patterns 12 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  13. Quality: What is Good Clustering? • A good clustering method will produce high quality clusters with • high intra-class similarity • low inter-class similarity • The quality of a clustering result depends on both the implementation of a method and the similarity measure used by this method • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns What kinds of similarity measure ? What kinds of method ? 13 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  14. Similarity measure The similarity measure depends on the characteristics of the input data • Attribute type: binary, categorical, continuous • Sparseness • Dimensionality • Type of proximity Center-based Density-based 14 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  15. Data Structures Distance matrix (dissimilarity matrix) • Data matrix n instances, p attributes (features) • Minkowskidistance: • If q = 1, d is Manhattan distance • If q = 2, d is Euclidean distance • Cosine measure • Correlation coefficient 15 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  16. Partitioning Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Typical methods: k-means, k-medoids, CLARANS Types of Clustering methods A Partitioning Clustering 16 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  17. Hierarchical clustering A set of nested clusters organized as a hierarchical tree Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON Types of Clustering methods Hierarchical Clustering Dendrogram 17 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  18. Density-based Clustering Based on connectivity and density functions A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Typical methods: DBSACN, OPTICS, DenClue Types of Clustering methods 6 density-based clusters 18 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  19. Grid-based Clustering Based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE Types of Clustering methods 19 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  20. Model-based: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods:EM, SOM, COBWEB Types of Clustering methods 20 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  21. Cluster Analysis • What is Cluster Analysis? • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Methods • Clustering High-Dimensional Data • How to decide the number of clusters? 21 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  22. Partitioning Algorithms: Basic Concept • Partitioning clustering method: Construct a partition of a dataset D of n objects into a set of k clusters, s.t., min sum of squared distance Averaged center of the cluster • NP-hard when k is a part of input (even for 2-dim)* • Given a k, finding a partition of k clusters that • optimizes SSD takes • Heuristic methods: k-means(also called Lloyd’s method [Llo82]) x C2 x C1 μi C3 x A Partitioning Clustering k=3 * Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. (2009). "The Planar k-Means Problem is NP-Hard". Lecture Notes in Computer Science 5431: 274–285. # Inaba; Katoh; Imai (1994). Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering". Proceedings of 10th ACM Symposium on Computational Geometry. 22

  23. Partitioning Algorithms: Basic Concept • Partitioning clustering method: Construct a partition of a dataset D of n objects into a set of k clusters, s.t., min sum of squared distance • Actual center of the cluster • Global optimal: • exhaustively enumerate all partitions • Heuristic methods: k-medoidsor PAM (Partition around medoids) (Kaufman & Rousseeuw’87) O C2 O C1 μi C3 O A Partitioning Clustering k=3 23 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  24. Partitioning Algorithms • k-means • Algorithm • Issue of initial centroids , clustering evaluation • Limitations of k-means • k-medoids 24 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  25. k-means clustering • Number of clusters, K, must be specified • Each cluster is associated with an averaged point (centroid) • Each point is assigned to the cluster with the closest centroid • The basic algorithm is very simple 25 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  26. Example: 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 k-means clustering 10 9 8 7 6 5 Update the cluster means Assign each objects to most similar center 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means 26 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  27. K-means Clustering – Details • Initial centroids are often chosen randomly. • Clusters produced vary from one run to another. • The centroid is (typically) the mean of the points in the cluster. • ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. • K-means will converge for common similarity measures mentioned above. • Most of the convergence happens in the first few iterations. • Often the stopping condition is changed to ‘Until relatively few points change clusters’ • Complexity is O( n * K * t * d ) n = number of points, K = number of clusters, t = number of iterations, d = number of attributes 27 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  28. Partitioning Algorithms • k-means • Algorithm • Issue of initial centroids , clustering evaluation • Limitations of k-means • k-medoids 28 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  29. Importance of Choosing Initial Centroid Run 2 Run 1 • Clusters produced vary from one run to another. Original Points 29

  30. Evaluating K-means Clustering • Most common measure is Sum of Squared Error (SSE) • For each point, the error is the distance to the nearest centroid (the error of representing each point by its nearest centroid) • Given two clustering results, we can choose the one with smaller error • SSE of optimal clustering result reduces when increasing K, the number of clusters • A good clustering with smaller K can have a lower SSE than a poor clustering with higher K • Note: C1 is better than C2 30 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  31. Importance of Choosing Initial Centroid Run 2 Run 1 • Clusters produced vary from one run to another. Original Points SSE(Run1) < SSR(Run2) 31

  32. Solutions to Initial Centroids Problem • Multiple runs • select the one with smallest SSE • Sample and use hierarchical clustering to determine initial centroids 32 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  33. Comments on the K-Means Method • Strength:Relatively efficient: O(nktd), where n is # objects, k is # clusters, t is # iterations , and d is # dimensions. Normally, k, t ,d<< n. • Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) • Comment: Often terminates at a local optimum. • Weakness • Applicable only when mean is defined, then what about categorical data? • Need to specify k, the number of clusters, in advance • Unable to handle noisy data and outliers • Not suitable to discover clusters with differing sizes, differing density, non-convex shapes 33 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  34. Partitioning Algorithms • k-means • Algorithm • Issue of initial centroids , clustering evaluation • Limitations of k-means • k-medoids 34 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  35. Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points 35 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  36. Limitations of K-means: Differing Density K-means (3 Clusters) Original Points 36 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  37. Limitations of K-means: Non-convex Shapes Original Points K-means (2 Clusters) 37 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  38. Overcoming K-means Limitations Original Points K-means Clusters • One solution is to use many clusters. • Find parts of clusters, but need to put together. 38 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  39. Overcoming K-means Limitations Original Points K-means Clusters • One solution is to use many clusters. • Find parts of clusters, but need to put together. 39 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  40. Overcoming K-means Limitations Original Points K-means Clusters • One solution is to use many clusters. • Find parts of clusters, but need to put together. 40 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  41. 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Limitations of K-means: sensitive to outliers • The k-means algorithm is sensitive to outliers ! • Since an object with an extremely large value may substantially distort the distribution of the data. • K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 41 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  42. Partitioning Algorithms • k-means • Algorithm • Issue of initial centroids , clustering evaluation • Limitations of k-means • k-medoids • PAM • CLARA • CLARANS 42 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  43. The K-Medoids Clustering Method • Find representative objects, called medoids, in clusters • k-medoids, use the same strategy of k-means • PAM(Partitioning Around Medoids, 1987) • starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering • PAM works effectively for small data sets, but does not scale well for large data sets • CLARA(Kaufmann & Rousseeuw, 1990) • CLARANS(Ng & Han, 1994): Randomized sampling 43 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  44. 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 A Typical K-Medoids Algorithm (PAM) Total Cost ({mi}) = 20 10 9 8 Arbitrary choose k object as initial medoids (mi,i=1..k) Assign each remaining object to nearest medoids 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 K=2 select a nonmedoidobject,Oj Total Cost = 18 Compute total cost of all possible new set of medoids (Oj,{mi},i≠t) Swapping mt and Oj If cost({Oj,{mi,i≠t})} is the smallest, and cost ({Oj,{mi,i≠t}}) < cost({mi}). Do loop Until no change 44

  45. PAM (Partitioning Around Medoids) (1987) • PAM(Partitioning Around Medoids, Kaufman and Rousseeuw, 1987) • Use real object to represent the cluster • Select k representative objects (medoids) arbitrarily • For each pair of non-selected object h and selected objecti, calculate the total swapping cost TCihTCih= total_cost(replace i by h) - total_cost(no replace) • If min(TCih )< 0, i is replaced by h Then assign each non-selected object to the most similar representative object • repeat steps 2-3 until there is no change • O(k(n-k)2 ) for each iteration where n is # of data, k is # of clusters 45 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  46. CLARA (Clustering Large Applications) (1990) • CLARA(Clustering LARge Applications, Kaufmann and Rousseeuw) • Sampling based method: • It draws multiple samples of the data set, • applies PAM on each sample, • gives the best clustering as the output (minimizing cost/SSE) • Strength: deals with larger data sets than PAM • Weakness: • Efficiency depends on the sample size • A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased 46 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  47. CLARANS (“Randomized” CLARA) (1994) • CLARANS(A Clustering Algorithm based on Randomized Search, Ng and Han’94) • The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of kmedoids • PAM: checks every neighbor • CLARA: examines fewer neighbors, searches in subgraphs built from samples • CLARANS: searches the whole graph but draws sample of neighbors dynamically Two nodes are connected as neighbors if their sets differ by only one item Each node: k medoids, which correspond to a clustering ……. ……. ……. each node has k(n-k) neighbors 47 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  48. CLARANS (“Randomized” CLARA) (1994) ……. ……. ……. • CLARANS: searches the whole graph but draws sample of neighbors dynamically • It is more efficient and scalable than both PAM and CLARA • Focusing techniques and spatial access structures may further improve its performance (Ester et al.’95) 48 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  49. What you should know • What is clustering? • What is partitioning clustering method? • How does k-means work? • The limitation of k-means • How does k-mediods work? • How to solve the scalability problem of k-mediods? 49 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

More Related