1 / 40

Clustering as graph cut

Clustering as graph cut. Describe the pairwise distance via a graph. Clustering as graph cut. Describe the pairwise distance via a graph Clustering can be obtained via graph cut. Clustering as graph cut. Describe the pairwise distance via a graph Clustering can be obtained via graph cut.

scovington
Download Presentation

Clustering as graph cut

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering as graph cut • Describe the pairwise distance via a graph CS 6501: Text Mining

  2. Clustering as graph cut • Describe the pairwise distance via a graph • Clustering can be obtained via graph cut CS 6501: Text Mining

  3. Clustering as graph cut • Describe the pairwise distance via a graph • Clustering can be obtained via graph cut CS 6501: Text Mining

  4. Clustering as graph cut • Describe the pairwise distance via a graph • Clustering can be obtained via graph cut Cut by class label Cut by cluster label CS 6501: Text Mining

  5. Recap: external validation • Given class label on each instance • Rand index CS 6501: Text Mining

  6. k-means Clustering Hongning Wang CS@UVa

  7. Today’s lecture • k-means clustering • A typical partitionalclustering algorithm • Convergence property • Expectation Maximization algorithm • Gaussian mixture model CS 6501: Text Mining

  8. Partitional clustering algorithms • Partition instances into exactly k non-overlapping clusters • Flat structure clustering • Users need to specify the cluster size k • Task: identify the partition of k clusters that optimize the chosen partition criterion CS 6501: Text Mining

  9. Partitional clustering algorithms • Partition instances into exactly k non-overlapping clusters • Typical criterion • Optimal solution: enumerate every possible partition of size k and return the one maximizes the criterion Optimize this in an alternative way Inter-cluster distance Intra-cluster distance Let’s approximate this! Unfortunately, this is NP-hard! CS 6501: Text Mining

  10. k-means algorithm Input: cluster size k, instances , distance metric Output: cluster membership assignments • Initialize k cluster centroids (randomly if no domain knowledge is available) • Repeat until no instance changes its cluster membership: • Decide the cluster membership of instances by assigning them to the nearest cluster centroid • Update the k cluster centroids based on the assigned cluster membership Minimize intra distance Maximize inter distance CS 6501: Text Mining

  11. k-means illustration CS 6501: Text Mining

  12. k-means illustration Voronoi diagram CS 6501: Text Mining

  13. k-means illustration CS 6501: Text Mining

  14. k-means illustration CS 6501: Text Mining

  15. k-means illustration CS 6501: Text Mining

  16. Complexity analysis • Decide cluster membership • Compute cluster centroid • Assume k-means stops after iterations Don’t forget the complexity of distance computation, e.g., for Euclidean distance CS 6501: Text Mining

  17. Convergence property • Why will k-means stop? • Answer: it is a special version of Expectation Maximization (EM) algorithm, and EM is guaranteed to converge • However, it is only guaranteed to converge to local optimal, since k-means (EM) is a greedy algorithm CS 6501: Text Mining

  18. Probabilistic interpretation of clustering • The density model of is multi-modal • Each mode represents a sub-population • E.g., unimodal Gaussian for each group Mixture model Unimodal distribution Mixing proportion CS 6501: Text Mining

  19. Probabilistic interpretation of clustering • If is known for every • Estimating and is easy • Maximum likelihood estimation • This is Naïve Bayes Mixture model Unimodal distribution Mixing proportion CS 6501: Text Mining

  20. Probabilistic interpretation of clustering Usually a constrained optimization problem • But is unknown for all • Estimating and is generally hard • Appeal to the Expectation Maximization algorithm Mixture model Unimodal distribution Mixing proportion CS 6501: Text Mining

  21. Introduction to EM • Parameter estimation • All data is observable • Maximum likelihood estimator • Optimize the analytic form of • Missing/unobservable data • Data: X (observed) + Z (hidden) • Likelihood: • Approximate it! Most of cases are intractable E.g. cluster membership CS 6501: Text Mining

  22. Background knowledge • Jensen's inequality • For any convex function and positive weights CS 6501: Text Mining

  23. Expectation Maximization • Maximize data likelihood function by pushing the lower bound Proposal distributions for Jensen's inequality Lower bound! Components we need to tune when optimizing : and CS 6501: Text Mining

  24. Intuitive understanding of EM Data likelihood p(X| ) Easier to optimize, guarantee to improve data likelihood Lower bound  CS 6501: Text Mining

  25. Expectation Maximization (cont) • Optimize the lower bound w.r.t. Constant with respect to negative KL-divergence between and CS 6501: Text Mining

  26. Expectation Maximization (cont) • Optimize the lower bound w.r.t. • KL-divergence is non-negative, and equals to zero i.f.f. • A step further: when , we will get , i.e., the lower bound is tight! • Other choice of cannot lead to this tight bound, but might reduce computational complexity • Note: calculation of is based on current CS 6501: Text Mining

  27. Expectation Maximization (cont) • Optimize the lower bound w.r.t. • Optimal solution: Posterior distribution of given current model In k-means: this corresponds to assigning instance to its closest cluster centroid CS 6501: Text Mining

  28. Expectation Maximization (cont) • Optimize the lower bound w.r.t. Constant w.r.t. Expectation of complete data likelihood In k-means, we are not computing the expectation, but the most probable configuration, and then CS 6501: Text Mining

  29. Expectation Maximization • EM tries to iteratively maximize likelihood • “Complete” data likelihood: • Starting from an initial guess (0), • E-step: compute the expectation of the complete data likelihood • M-step: compute (t+1) by maximizing the Q-function Key step! CS 6501: Text Mining

  30. An intuitive understanding of EM • In k-means • E-step: identify the cluster membership - • M-step: update by Data likelihood p(X| ) next guess current guess Lower bound (Q function)  E-step = computing the lower bound M-step = maximizing the lower bound CS 6501: Text Mining

  31. Convergence guarantee • Proof of EM Taking expectation with respect to of both sides: Cross-entropy Then the change of log data likelihood between EM iteration is: By Jensen’s inequality, we know , that means M-step guarantee this CS 6501: Text Mining

  32. What is not guaranteed • Global optimal is not guaranteed! • Likelihood: is non-convex in most of cases • EM boils down to a greedy algorithm • Alternative ascent • Generalized EM • E-step: • M-step: choose that improves CS 6501: Text Mining

  33. k-means v.s. Gaussian Mixture • If we use Euclidean distance in k-means • We have explicitly assumed is Gaussian • Gaussian Mixture Model (GMM) Multinomial We do not consider cluster size in k-means In k-means, we assume equal variance across clusters, so we don’t need to estimate them CS 6501: Text Mining

  34. k-means v.s. Gaussian Mixture • Soft v.s., hard posterior assignment GMM k-means CS 6501: Text Mining

  35. k-means in practice • Extremely fast and scalable • One of the most popularly used clustering methods • Top 10 data mining algorithms – ICDM 2006 • Can be easily parallelized • Map-Reduce implementation • Mapper: assign each instance to its closest centroid • Reducer: update centroid based on the cluster membership • Sensitive to initialization • Prone to local optimal CS 6501: Text Mining

  36. Better initialization: k-means++ • Choose the first cluster center at uniformly random • Repeat until all k centers have been found • For each instance compute • Choose a new cluster center with probability • Run k-means with selected centers as initialization new center should be far away from existing centers CS 6501: Text Mining

  37. How to determine k • Vary to optimize clustering criterion • Internal v.s. external validation • Cross validation • Abrupt change in objective function CS 6501: Text Mining

  38. How to determine k • Vary to optimize clustering criterion • Internal v.s. external validation • Cross validation • Abrupt change in objective function • Model selection criterion – penalizing too many clusters • AIC, BIC CS 6501: Text Mining

  39. What you should know • k-means algorithm • An alternative greedy algorithm • Convergence guarantee • EM algorithm • Hard clustering v.s., soft clustering • k-means v.s., GMM CS 6501: Text Mining

  40. Today’s reading • Introduction to Information Retrieval • Chapter 16: Flat clustering • 16.4 k-means • 16.5 Model-based clustering CS 6501: Text Mining

More Related