1 / 27

Clustering

Clustering. Basic concepts with simple examples Categories of clustering methods Challenges. What is clustering?. The process of grouping a set of physical or abstract objects into classes of similar objects. It is also called unsupervised learning .

charo
Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Basic concepts with simple examples Categories of clustering methods Challenges CSE572, CBS572: Data Mining by H. Liu

  2. What is clustering? • The process of grouping a set of physical or abstract objects into classes of similar objects. • It is also called unsupervised learning. • It is a common and important task that finds many applications • Examples where we need clustering? CSE572, CBS572: Data Mining by H. Liu

  3. Clusters and representations • Examples of clusters • Different ways of representing clusters • Division with boundaries • Venn diagrams or spheres • Probabilistic • Dendrograms • Trees • Rules 1 2 3 I1 I2 … In 0.5 0.2 0.3 CSE572, CBS572: Data Mining by H. Liu

  4. Differences from Classification • How different? • Which one is more difficult as a learning problem? • Do we perform clustering in daily activities? • How do we cluster? • How to measure the results of clustering? • With/without class labels • Between classification and clustering • Semi-supervised clustering CSE572, CBS572: Data Mining by H. Liu

  5. Major clustering methods • Partitioning methods • k-Means (and EM), k-Medoids • Hierarchical methods • agglomerative, divisive, BIRCH • Similarity and dissimilarity of points in the same cluster and from different clusters • Distance measures between clusters • minimum, maximum • Means of clusters • Average between clusters CSE572, CBS572: Data Mining by H. Liu

  6. How to evaluate • Without labeled data, how can one know one clustering result is good? • Basic or intuitive idea of clustering for clustered data points • Within a cluster - • Between clusters – • The relationship between the two? • Evaluation methods • Labeled data – another assumption: instances in the same clusters are of the same class • Is it reasonable to use class labels in evaluation? • Unlabeled data – we will see below CSE572, CBS572: Data Mining by H. Liu

  7. Clustering -- Example 1 • For simplicity, 1-dimension objects and k=2. • Objects: 1, 2, 5, 6,7 • K-means: • Randomly select 5 and 6 as centroids; • => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 • => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 • => no change. • Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 = 2.5 CSE572, CBS572: Data Mining by H. Liu

  8. Issues with k-means • A heuristic method • Sensitive to outliers • How to prove it? • Determining k • Trial and error • X-means, PCA-based • Crisp clustering • EM, Fuzzy c-means • Should not be confused with k-NN CSE572, CBS572: Data Mining by H. Liu

  9. k-Medoids • Medoid – the most centrally located point in a cluster, as a representative point of the cluster. • In contrast, a centroid is not necessarily inside a cluster. • An example Initial Medoids CSE572, CBS572: Data Mining by H. Liu

  10. Partition Around Medoids • PAM: • Given k • Randomly pick k instances as initial medoids • Assign each instance to the nearest medoid x • Calculate the objective function • the sum of dissimilarities of all instances to their nearest medoids • Randomly select an instance y • Swap x by y if the swap reduces the objective function for all x • Repeat (3-6) until no change CSE572, CBS572: Data Mining by H. Liu

  11. k-Means and k-Medoids • The key difference lies in how they update means or medoids • k-medoids and (N-k) instances pairwise comparison • Both require distance calculation and reassignment of instances • Time complexity • Which one is more costly? • Dealing with outliers Outlier (100 unit away) CSE572, CBS572: Data Mining by H. Liu

  12. Agglomerative • Each object is viewed as a cluster (bottom up). • Repeat until the number of clusters is small enough • Choose a closest pair of clusters • Merge the two into one • Defining “closest”: Centroid (mean of cluster) distance, (average) sum of pairwise distance, … • Refer to the Evaluation part • A dendrogram is a tree that shows clustering process. CSE572, CBS572: Data Mining by H. Liu

  13. Clustering -- Example 2 • For simplicity, we still use 1-dimension objects. • Objects: 1, 2, 5, 6,7 • agglomerative clustering – a very frequently used algorithm • How to cluster: • find two closest objects and merge; • => {1,2}, so we have now {1.5,5, 6,7}; • => {1,2}, {5,6}, so {1.5, 5.5,7}; • => {1,2}, {{5,6},7}. CSE572, CBS572: Data Mining by H. Liu

  14. Issues with dendrograms • How to find proper clusters • An alternative: divisive algorithms • Top down • Comparing with bottom-up, which is more efficient • What’s the time complexity? • How to efficiently divide the data • A heuristic – Minimum Spanning Tree http://en.wikipedia.org/wiki/Minimum_spanning_tree • Time complexity – fastest is about O(e) where e - edges CSE572, CBS572: Data Mining by H. Liu

  15. Distance measures • Single link • Measured by the shortest edge between the two clusters • Complete link • Measured by the longest edge • Average link • Measured by the average edge length • An example is shown next. CSE572, CBS572: Data Mining by H. Liu

  16. An example to show different Links • Single link • Merge the nearest clusters measured by the shortest edge between the two • (((A B) (C D)) E) • Complete link • Merge the nearest clusters measured by the longest edge between the two • (((A B) E) (C D)) • Average link • Merge the nearest clusters measured by the average edge length between the two • (((A B) (C D)) E) B A E C D CSE572, CBS572: Data Mining by H. Liu

  17. Other Methods • Density-based methods • DBSCAN: a cluster is a maximal set of density-connected points • Core points defined using epsilon-neighborhood and minPts • Apply directly density reachable (e.g., P and Q, Q and M) and density-reachable (P and M, assuming so are P and N), and density-connected (any density reachable points, P, Q, M, N) form clusters • Grid-based methods • STING: the lowest level is the original data • statistical parameters of higher-level cells are computed from the parameters of the lower-level cells (count, mean, standard deviation, min, max, distribution) • Model-based methods • Conceptual clustering: COBWEB • Category utility • Intraclass similarity • Interclass dissimilarity CSE572, CBS572: Data Mining by H. Liu

  18. Density-based • DBSCAN – Density-Based Clustering of Applications with Noise • It grows regions with sufficiently high density into clusters and can discover clusters of arbitraryshape in spatial databases with noise. • Many existing clustering algorithms find spherical shapes of clusters • DEBSCAN defines a cluster as a maximal set of density-connected points. • Density is defined by an area and # of points CSE572, CBS572: Data Mining by H. Liu

  19. Q M R S P O • Defining density and connection • -neighborhood of an object x (core object) (M, P, O) • MinPts of objects within -neighborhood (say, 3) • directly density-reachable (Q from M, M from P) • Only core objects are mutually density reachable • density-reachable (Q from P, P not from Q) [asymmetric] • density-connected (O, R, S) [symmetric] for border points • What is the relationship between DR and DC? CSE572, CBS572: Data Mining by H. Liu

  20. Clustering with DBSCAN • Search for clusters by checking the -neighborhood of each instance x • If the -neighborhood of x contains more than MinPts, create a new cluster with x as a core object • Iteratively collect directly density-reachable objects from these core object and merge density-reachable clusters • Terminate when no new point can be added to any cluster • DBSCAN is sensitive to the thresholds of density, but it is fast • Time complexity O(N log N) if a spatial index is used, O(N2) otherwise CSE572, CBS572: Data Mining by H. Liu

  21. Grid: STING (STatistical INformation Grid) • Statistical parameters of higher-level cells can easily be computed from those of lower-level cells • Attribute-independent: count • Attribute-dependent: mean, standard deviation, min, max • Type of distribution: normal, uniform, exponential, or unknown • Irrelevant cells can be removed CSE572, CBS572: Data Mining by H. Liu

  22. BIRCH using Clustering Feature (CF) and CF tree • A cluster feature is a triplet about sub-clusters of instances (N, LS, SS) • N - the number of instances, LS – linear sum, SS – square sum • Two thresholds: branching factor and the max number of children per non-leaf node • Two phases • Build an initial in-memory CF tree • Apply a clustering algorithm to cluster the leaf nodes in CF tree • CURE (Clustering Using REpresentitives) is another example, allowing multiple centroids in a cluster CSE572, CBS572: Data Mining by H. Liu

  23. Taking advantage of the property of density • If it’s dense in higher dimensional subspaces, it should be dense in some lower dimensional subspaces • How to use this property? • CLIQUE (CLustering In QUEst) • With high dimensional data, there are many void subspaces • Using the property identified, we can start with dense lower dimensional data • CLIQUE is a density-based method that can automatically find subspaces of the highest dimensionality such that high-density clusters exist in those subspaces CSE572, CBS572: Data Mining by H. Liu

  24. Chameleon • A hierarchical Clustering Algorithm Using Dynamic Modeling • Observations on the weakness of CURE and ROCK • CURE: clustering using representatives • ROCK: clustering categorical attributes • Based on k-nn and dynamic modeling CSE572, CBS572: Data Mining by H. Liu

  25. Graph-based clustering • Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points. • The nearest neighbors of a point tend to belong to the same class as the point itself. • This reduces the impact of noise and outliers and sharpens the distinction between clusters. CSE572, CBS572: Data Mining by H. Liu

  26. Neural networks • Self-organizing feature maps (SOMs) • Subspace clustering • Clique: if a k-dimensional unit space is dense, then so are its (k-1)-d subspaces • More will be discussed later • Semi-supervised clustering http://www.cs.utexas.edu/~ml/publication/unsupervised.html http://www.cs.utexas.edu/users/ml/risc/ CSE572, CBS572: Data Mining by H. Liu

  27. Challenges • Scalability • Dealing with different types of attributes • Clusters with arbitrary shapes • Automatically determining input parameters • Dealing with noise (outliers) • Order insensitivity of instances presented to learning • High dimensionality • Interpretability and usability CSE572, CBS572: Data Mining by H. Liu

More Related