1 / 67

Chapter 10 Unsupervised Learning & Clustering

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher. Chapter 10 Unsupervised Learning & Clustering . Introduction

suchi
Download Presentation

Chapter 10 Unsupervised Learning & Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern ClassificationAll materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000with the permission of the authors and the publisher

  2. Chapter 10Unsupervised Learning & Clustering Introduction Mixture Densities and Identifiability ML Estimates Application to Normal Mixtures K-means algorithm Unsupervised Bayesian Learning Data description and clustering Criterion function for clustering Hierarchical clustering The number of cluster problem and cluster validation On-line clustering Graph-theoretic methods PCA and ICA Low-dim reps and multidimensional scaling (self-organizing maps) Clustering and dimensionality reduction

  3. Introduction • Previously, all our training samples were labeled: these samples were said “supervised” • Why are we interested in “unsupervised” procedures which use unlabeled samples? • Collecting and Labeling a large set of sample patterns can be costly • We can train with large amounts of (less expensive) unlabeled data Then use supervision to label the groupings found, this is appropriate for large “data mining” applications where the contents of a large database are not known beforehand Pattern Classification, Chapter 10

  4. Patterns may change slowly with time Improved performance can be achieved if classifiers running in a unsupervised mode are used • We can use unsupervised methods to identify features that will then be useful for categorization  ‘smart’ feature extraction • We gain some insight into the nature (or structure) of the data  which set of classification labels? Pattern Classification, Chapter 10

  5. Mixture Densities & Identifiability • Assume: • functional forms for underlying probability densities are known • value of an unknown parameter vector must be learned • i.e., like chapter 3 but without class labels • Specific assumptions: • The samples come from a known number c of classes • The prior probabilities P(j) for each class are known (j = 1, …,c) • Forms for the P(x | j, j) (j = 1, …,c) are known • The values of the c parameter vectors 1, 2, …, c are unknown • The category labels are unknown Pattern Classification, Chapter 10

  6. The PDF for the samples is: • This density function is called a mixture density • Our goal will be to use samples drawn from this mixture density to estimate the unknown parameter vector . • Once  is known, we can decompose the mixture into its components and use a MAP classifier on the derived densities. Pattern Classification, Chapter 10

  7. Can  be recovered from the mixture? • Consider the case where: • Unlimited number of samples • Use nonparametric technique to find p(x| ) for every x • If several  result in same p(x| )  can’t find unique solution This is the issue of solution identifiability. • Definition: IdentifiabilityA density P(x | ) is said to be identifiable if   ’ implies that there exists an x such that: P(x | )  P(x | ’) Pattern Classification, Chapter 10

  8. As a simple example, consider the case where x is binary and P(x | ) is the mixture: Assume that: P(x = 1 | ) = 0.6  P(x = 0 | ) = 0.4 We know P(x | ) but not  We can say: 1 + 2= 1.2 but not what 1 and 2 are. Thus, we have a case in which the mixture distribution is completely unidentifiable, and therefore unsupervised learning is impossible. Pattern Classification, Chapter 10

  9. In the discrete distributions too many components can be problematic • Too many unknowns • Perhaps more unknowns than independent equations identifiability can become a serious problem! Pattern Classification, Chapter 10

  10. While it can be shown that mixtures of normal densities are usually identifiable, the parameters in the simple mixture density cannot be uniquely identified if P(1) = P(2) (we cannot recover a unique even from an infinite amount of data!) •  = (1, 2) and  = (2, 1) are two possible vectors that can be interchanged without affecting P(x | ). • Identifiability can be a problem, we always assume that the densities we are dealing with are identifiable! Pattern Classification, Chapter 10

  11. ML Estimates Suppose that we have a set D = {x1, …, xn} of n unlabeled samples drawn independently from the mixture density: ( is fixed but unknown!) The MLE is: Pattern Classification, Chapter 10

  12. ML Estimates Then the log-likelihood is: And the gradient of the log-likelihood is: Pattern Classification, Chapter 10

  13. Since the gradient must vanish at the value of i that maximizes l , the ML estimate must satisfy the conditions Pattern Classification, Chapter 10

  14. The MLE for P(wi) and must satisfy: Pattern Classification, Chapter 10

  15. Applications to Normal Mixtures p(x | i, i) ~ N(i, i) Case 1 = Simplest case Case 2 = more realistic case Pattern Classification, Chapter 10

  16. Case 1: Multivariate Normal, Unknown mean vectors i = i  i = 1, …, c, The likelihood is for the ith mean is: ML estimate of  = (i) is: Where is the fraction of those samples having value xk that come from the ith class, and is the average of the samples coming from the i-th class. Pattern Classification, Chapter 10

  17. Unfortunately, equation (1) does not give explicitly • However, if we have some way of obtaining good initial estimates for the unknown means, equation (1) can be seen as an iterative process for improving the estimates Pattern Classification, Chapter 10

  18. This is a gradient ascent for maximizing the log-likelihood function • Example: Consider the simple two-component one-dimensional normal mixture (2 clusters!) Let’s set 1 = -2, 2 = 2 and draw 25 samples sequentially from this mixture. The log-likelihood function is: w1 w2 Pattern Classification, Chapter 10

  19. The maximum value of l occurs at: (which are not far from the true values: 1 = -2 and 2 = +2) There is another peak at which has almost the same height as can be seen from the following figure. This mixture of normal densities is identifiable When the mixture density is not identifiable, the ML solution is not unique Pattern Classification, Chapter 10

  20. Pattern Classification, Chapter 10

  21. Case 2: All parameters unknown • No constraints are placed on the covariance matrix • Let p(x | , 2) be the two-component normal mixture: Pattern Classification, Chapter 10

  22. Suppose  = x1, therefore: For the rest of the samples:Finally, The likelihood is therefore large and the maximum-likelihood solution becomes singular. Pattern Classification, Chapter 10

  23. Assumption: MLE is well-behaved at local maxima. • Consider the largest of the finite local maxima of the likelihood function and use the ML estimation. • We obtain the following local-maximum-likelihood estimates: Iterative scheme Pattern Classification, Chapter 10

  24. Where: Pattern Classification, Chapter 10

  25. K-Means Clustering • Goal: find the c mean vectors 1, 2, …, c • Replace the squared Mahalanobis distance • Find the mean nearest to xkand approximate as: • Use the iterative scheme to find Pattern Classification, Chapter 10

  26. If n is the known number of patterns and c the desired number of clusters, the k-means algorithm is: Begin initialize n, c, 1, 2, …, c(randomly selected) do classify n samples according to nearest i recompute i until no change in i return 1, 2, …, c End Complexity is O(ndcT) where d is the # features, T the # iterations Pattern Classification, Chapter 10

  27. K-means cluster on data from previous figure Pattern Classification, Chapter 10

  28. Unsupervised Bayesian Learning • Other than the ML estimate, the Bayesian estimation technique can also be used in the unsupervised case (see chapters ML & Bayesian methods, Chap. 3 of the textbook) • number of classes is known • class priors are known • forms of class-conditional probability densities P(x|wj, qj) are known • However, the full parameter vector q is unknown • Part of our knowledge about q is contained in the prior p(q) • rest of our knowledge of q is in the training samples • We compute the posterior distribution using the training samples Pattern Classification, Chapter 10

  29. We can compute p(|D) as seen previously • and passing through the usual formulation introducing the unknown parameter vector . • Hence, the best estimate of p(x|i) is obtained by averaging p(x|i, i) over i. • The goodness of this estimate depends on p(|D); this is the main issue of the problem. P(wi|D) = P(wi) since selection of wi is independent of previous samples Pattern Classification, Chapter 10

  30. From Bayes we get: where independence of the samples yields the likelihood or alternately (denoting Dn the set of n samples) the recursive form: • If p() is almost uniform in the region where p(D|) peaks, then p(|D) peaks in the same place. Pattern Classification, Chapter 10

  31. If the only significant peak occurs at and the peak is very sharp, then and • Therefore, the ML estimate is justified. • Both approaches coincide if large amounts of data are available. • In small sample size problems they can agree or not, depending on the form of the distributions • The ML method is typically easier to implement than the Bayesian one Pattern Classification, Chapter 10

  32. Formal Bayesian solution: unsupervised learning of the parameters of a mixture density is similar to the supervised learning of the parameters of a component density. • Significant differences: identifiability, computational complexity • The issue of identifiability • With SL, the lack of identifiability means that we do not obtain a unique vector, but an equivalence class, which does not present theoretical difficulty as all yield the same component density. • With UL, the lack of identifiability means that the mixture cannot be decomposed into its true components  p(x |Dn) may still converge to p(x), but p(x |i,Dn) will not in general converge to p(x |i), hence there is a theoretical barrier. • The computational complexity • With SL, the sufficient statistics allows the solutions to be computationally feasible Pattern Classification, Chapter 10

  33. With UL, samples comes from a mixture density and there is little hope of finding simple exact solutions for p(D |).  n samples results in 2n terms. (Corresponding to the ways in the which the n samples can be drawn from the 2 classes.) • Another way of comparing the UL and SL is to consider the usual equation in which the mixture density is explicit Pattern Classification, Chapter 10

  34. From Previous slide • If we consider the case in which P(1)=1 and all other prior probabilities as zero, corresponding to the supervised case in which all samples comes from the class 1, then we get Pattern Classification, Chapter 10

  35. Eqns From Previous slide • Comparing the two eqns, we see that observing an additional sample changes the estimate of . • Ignoring the denominator which is independent of , the only significant difference is that • in the SL, we multiply the “prior” density for  by the component density p(xn |1,1) • in the UL, we multiply the “prior” density by the whole mixture • Assuming that the sample did come from class 1, the effect of not knowing this category is to diminish the influence of xn in changing  for category 1.. Pattern Classification, Chapter 10

  36. Data Clustering • Structures of multidimensional patterns are important for clustering • If we know that data come from a specific distribution, such data can be represented by a compact set of parameters (sufficient statistics) • If samples are considered coming from a specific distribution, but actually they are not, these statistics is a misleading representation of the data Pattern Classification, Chapter 10

  37. Aproximation of density functions: • Mixture of normal distributions can approximate arbitrary PDFs • In these cases, one can use parametric methods to estimate the parameters of the mixture density. • No free lunch  dimensionality issue! • Huh? Pattern Classification, Chapter 10

  38. Caveat • If little prior knowledge can be assumed, the assumption of a parametric form is meaningless: • Issue: imposing structure vs finding structure • use non parametric method to estimate the unknown mixture density. • Alternatively, for subclass discovery: • use a clustering procedure • identify data points having strong internal similarities Pattern Classification, Chapter 10

  39. Similarity measures • What do we mean by similarity? • Two isses: • How to measure the similarity between samples? • How to evaluate a partitioning of a set into clusters? • Obvious measure of similarity/dissimilarity is the distance between samples • Samples of the same cluster should be closer to each other than to samples in different classes. Pattern Classification, Chapter 10

  40. Euclidean distance is a possible metric: • assume samples belonging to same cluster if their distance is less than a threshold d0 • Clusters defined by Euclidean distance are invariant to translations and rotation of the feature space, but not invariant to general transformations that distort the distance relationship Pattern Classification, Chapter 10

  41. Achieving invariance: • normalize the data, e.g., such that they all have zero means and unit variance, • or use principal components for invariance to rotation • A broad class of metrics is the Minkowsky metric where q1 is a selectable parameter: q = 1  Manhattan or city block metric q = 2  Euclidean metric • One can also used a nonmetric similarity function s(x,x’) to compare 2 vectors. Pattern Classification, Chapter 10

  42. It is typically a symmetric function whose value is large when x and x’ are similar. • For example, the inner product • In case of binary-valued features, we have, e.g.: Tanimoto distance Pattern Classification, Chapter 10

  43. Clustering as optimization • The second issue: how to evaluate a partitioning of a set into clusters? • Clustering can be posed as an optimization of a criterion function • The sum-of-squared-error criterion and its variants • Scatter criteria • The sum-of-squared-error criterion • Let ni the number of samples in Di, and mi the mean of those samples Pattern Classification, Chapter 10

  44. The sum of squared error is defined as • This criterion defines clusters by their mean vectors mi  it minimizes the sum of the squared lengths of the error x - mi. • The minimum variance partition minimizes Je • Results: • Good when clusters form well separated compact clouds • Bad with large differences in the numberof samples in different clusters. Pattern Classification, Chapter 10

  45. Scatter criteria • Scatter matrices used in multiple discriminant analysis, i.e., the within-scatter matrix SW and the between-scatter matrix SB ST = SB +SW • Note: • STdoes not depend on partitioning • In contrast, SB and SW depend on partitioning • Two approaches: • minimize the within-cluster • maximize the between-cluster scatter Pattern Classification, Chapter 10

  46. The trace (sum of diagonal elements) is the simplest scalar measure of the scatter matrix • proportional to the sum of the variances in the coordinate directions • This is the sum-of-squared-error criterion, Je. Pattern Classification, Chapter 10

  47. As tr[ST] = tr[SW] + tr[SB] and tr[ST] is independent from the partitioning, no new results can be derived by minimizing tr[SB] • However, seeking to minimize the within-cluster criterion Je=tr[SW], is equivalent to maximise the between-cluster criterion where m is the total mean vector: Pattern Classification, Chapter 10

  48. Iterative optimization • Clustering  discrete optimization problem • Finite data set  finite number of partitions • What is the cost of exhaustive search? cn/c! For c clusters. Not a good idea • Typically iterative optimization used: • starting from a reasonable initial partition • Redistribute samples to minimize criterion function. •  guarantees local, not global, optimization. Pattern Classification, Chapter 10

  49. consider an iterative procedure to minimize the sum-of-squared-error criterion Je where Ji is the effective error per cluster. • Moving sample from cluster Di to Dj, changes the errors in the 2 clusters by: Pattern Classification, Chapter 10

  50. Hence, the transfer is advantegeous if the decrease in Ji is larger than the increase in Jj Pattern Classification, Chapter 10

More Related