1 / 66

Lecture 6

Lecture 6. Statistical Lecture ─ Cluster Analysis. Cluster Analysis. Grouping similar objects to produce a classification Useful when the priori the structure of the data is unknown Involving the assessment of the relative distances between points. Clustering Algorithms. Partitioning :

monty
Download Presentation

Lecture 6

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 6 Statistical Lecture ─ Cluster Analysis

  2. Cluster Analysis • Grouping similar objects to produce a classification • Useful when the priori the structure of the data is unknown • Involving the assessment of the relative distances between points

  3. Clustering Algorithms • Partitioning : • Divide the data set into k clusters where k needs to be specified beforehand, e.g. • k-means.

  4. Clustering Algorithms • Hierarchical : • Agglomerative methods : • Start with the situation where each object forms its own little cluster, and then successively merges clusters until only one large cluster left • Divisive methods : • Start by considering the whole data set as one cluster, and then splits up clusters until each object is separate

  5. Caution • Most users are interested in the main structure of their data, consisting of a few large clusters • When forming larger clusters, agglomerative methods might makes wrong decisions in the first step. (Once one step is wrong, the whole thing is wrong) • For divisive methods, the larger clusters are determined first, so they are less likely to suffer from earlier steps

  6. Agglomerative Hierarchical Clustering Procedure (1) Each observation begins in a cluster by itself (2) The two closest clusters are merged to from a new cluster that replaces the two old clusters (3) Repeat (2) until only one cluster is left The various clustering methods differ in how the distance between two clusters is computed.

  7. Remarks • For coordinate data, variables with large variances tend to have more effect on the resulting clusters than those with small variance • Scaling or transforming the variables might be needed • Standardization (standardize the variables to mean 0 and standard deviation 1) or principle components is useful but not always appropriate • Outliers should be removed before analysis

  8. Remarks(cont.) • Nonlinear transformations of the variables may change the number of population clusters and should therefore be approached with caution • For most applications, the variables should be transformed so that equal differences are of equal practical importance • An interval scale of measurement is required if raw data are used as input. Ordinal or ranked coordinate data are generally not appropriate

  9. Notation n number of observation v number of variables if data are coordinates G number of clusters at any given level of the hierarchy xi ith observation Ckkth cluster, subset of {1, 2, …, n} Nk number of observations in Ck

  10. Notation(cont.) sample mean vector mean vector for cluster Ck ||x|| Euclidean length of the vector x, that is the square root of the sum of the squares of the elements of x T Wk

  11. Notation(cont.) PGWj, where summation is over the G clusters at the Gth level of the hierarchy Bkl Wm – Wk – Wl if Cm=CkCl d(x, y) any distance or dissimilarity measure between observations or vectors x and y Dkl any distance or dissimilarity measure between clusters Ck and Cl

  12. Clustering Method ─ Average Linkage The distance between two clusters is defined by If d(x, y)=||x – y||2, then The combinatorial formula is if Cm=CkCl

  13. Average Linkage • The distance between clusters is the average distance between pairs of observations, one in each cluster • It tends to join clusters with small variance and is slightly biased toward producing clusters with the same variance

  14. Centroid Method The distance between two clusters is defined by If d(x, y)=||x – y||2, then the combinatorial formula is

  15. Centroid Method • The distance between two clusters is defined as the squared Euclidean distance between their centroids or means • It is more robust to outliers than most other hierarchical methods but in other respects may not perform as well as Ward’s method or average linkage

  16. Complete Linkage The distance between two clusters is defined by The combinatorial formula is

  17. Complete Linkage • The distance between two cluster is the maximum distance between an observation in one cluster and an observation in the other cluster • It is strongly biased toward producing clusters with roughly equal diameters and can be severely distorted by moderate outliers

  18. Single Linkage The distance between two clusters is defined by The combinatorial formula is

  19. Single Linkage • The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster • It sacrifices performance in the recovery of compact clusters in return for the ability to detect elongated and irregular clusters

  20. Ward’s Minimum-Variance Method The distance between two clusters is defined by If d(x, y)=||x – y||2, then the combinatorial formula is

  21. Ward’s Minimum-Variance Method • The distance between two clusters is the ANOVA sum of squares between the two clusters added up over all the variables • It tends to join clusters with a small number of observation • It is strongly biased toward producing clusters with roughly the same number of observations • It is also very sensitive to outliers

  22. Assumptions for WMVM • Multivariate normal mixture • Equal spherical covariance matrices • Equal sampling probabilities

  23. Remarks • Single linkage tends to lead to the formation of long straggly clusters • Average, complete linkage and Ward’s method often find spherical clusters even when the data appear to contain clusters of other shapes

  24. McQuitty’s Similarity Analysis The combinatorial formula is Median Method If d(x, y)=||x – y||2, then the combinatorial formula is

  25. Kth-nearest Neighbor Method • Prespecify k • Let rk(x) be the distance from point x to the kth nearest observation • Consider a closed sphere centered at x with radius rk(x), say Sk(x)

  26. Kth-nearest Neighbor Method • The estimated density at x is defined by • For any two observations xiand xj

  27. K-Means Algorithm • It is intended for use with large data sets, from approximately 100 to 100000 observations • With small data sets, the results may be highly sensitive to the order of the observations in the data set • It combines an effective method for finding initial clusters with a standard iterative algorithm for minimizing the sum of squared distance from the cluster means

  28. K-Means Algorithm • Specify the number of clusters, say k • A set of k points called cluster seeds is selected as a first guess of the means of the k clusters • Each observation is assigned to the nearest seed to form temporary clusters • The seeds are then replaced by the means of the temporary clusters • The process is repeated until no further changes occur in the clusters

  29. Cluster Seeds • Select the first complete (no missing values) observation as the first seed • The next complete observation that is separated from the first seed by at least the prespecified distance becomes the second seed • Later observations are selected as new seeds if they are separated from all previous seeds by at least the radius, as long as the maximum number of seeds is not exceeded

  30. Cluster Seeds If an observation is complete but fails to qualify as a new seed, two tests can be made to see if the observation can replace one of the old seeds

  31. Cluster Seeds(cont.) • An old seed is replaced if the distance between the observation and the closest seed is greater than the minimum distance between seeds. The seed that is replaced is selected from the two seeds that are closest to each other. The seed that is replaced is the one of these two with the shortest distance to the closest of the remaining seed when the other seed is replaced by the current observation

  32. Cluster Seeds(cont.) • If the observation fails the first test for seed replacement, a second test is made. The observation replaces the nearest seed if the smallest distance from the observation to all seeds other than the nearest one is greater than the shortest distance from the nearest seed to all other seeds. If this test is failed, go on to the next observation.

  33. Dissimilarity Matrices n n dissimilarity matrix where d(i, j)=d(j, i) measures the “difference” or dissimilarity between the objects i and j.

  34. Dissimilarity Matrices d usually satisfies • d(i, i) = 0 • d(i, j)  0 • d(i, j) = d(j, i)

  35. Dissimilarity Interval-scaled variables-continuous measurements on a (roughly) linear scale (temperature, height, weight, etc.)

  36. Dissimilarity(cont.) • The choice of measurement units strongly affects the resulting clustering • The variable with the large dispersion will have the largest impact on clustering • If all variables are considered equally important, the data need to be standardized first

  37. Standardization • Mean absolute deviation (Robust) • Median absolute deviation (Robust) • Usual standard deviation

  38. Continuous Ordinal Variables These are continuous measurements on an unknown scale, or where only the ordering is known but not the actual magnitude. • Replace the xif by their rank rif {1, …, Mf} • Transform the scale to [0,1] as follows : • Compute the dissimilarities as for interval-scaled variables

  39. Ratio-Scaled Variables These are positive continuous measurements on a nonlinear scale, such as an exponential scale. One example would be the growth of a bacterial population (say, with a growth function AeBt). • Simple as interval-scaled variables, though this is not recommended as it can distort the measurement scale • As continuous ordinal data • By first transforming the data (perhaps by taking logarithms), and then treating the results as interval-scaled variables

  40. Discrete Ordinal Variables A variable of this type has M possible values (scores) which are ordered. The dissimilarities are computed in the same way as for continuous ordinal variables.

  41. Nominal Variables • Such a variable has M possible values, which are not ordered. • The dissimilarity between objects i and j is usually defined as

  42. Symmetric Binary Variables Two possible values, coded 0 and 1, which are equally important (s.t. a male and female). Consider the contingency table of the objects i and j :

  43. Asymmetric Binary Variables Two possible values, one of which carries more importance than the other. The most meaningful outcome is coded as 1, and the less meaningful outcome as 0. Typically, 1 stands for the presence of a certain attribute (e.g., a particular distance), and 0 for its absence.

  44. Asymmetric Binary Variables

  45. Cluster Analysis of Flying Mileages Between 10 American Cities 0 ATLANTA 587 0 CHICAGO 1212 920 0 DENVER 701 940 879 0 HOUSTON 1936 1745 831 1374 0 LOS ANGELES 604 1188 1726 968 2339 0 MIAMI 748 713 1631 1420 2451 1092 0 NEW YORK 2139 1858 949 1645 347 2594 2571 0 SAN FRANCISCO 2182 1737 1021 1891 959 2734 2408 678 0 SEATTLE 543 597 1494 1220 2300 923 205 2442 2329 0 WASHINGTON D.C.

  46. Root-Mean-Square Distance Between Observations = 1580.242 Cluster History NCL Clusters Joined FREQ PSF PST2 NormRMSDist Tie 9 NEW YORK WASHINGTON D.C. 2 66.7 . 0.1297 8 LOS ANGELES SAN FRANCISCO 2 39.2 . 0.2196 7 ATLANTA CHICAGO 2 21.7 . 0.3715 6 CL7 CL9 4 14.5 3.4 0.4149 5 CL8 SEATTLE 3 12.4 7.3 0.5255 4 DENVER HOUSTON 2 13.9 . 0.5562 3 CL6 MIAMI 5 15.5 3.8 0.6185 2 CL3 CL4 7 16.0 5.3 0.8005 1 CL2 CL5 10 . 16.0 1.2967 The CLUSTER ProcedureAverage Linkage Cluster Analysis

  47. Average Linkage Cluster Analysis

  48. Root-Mean-Square Distance Between Observations = 1580.242 Cluster History NCL Clusters Joined FREQ PSF PST2 NormCentDist Tie 9 NEW YORK WASHINGTON D.C. 2 66.7 . 0.1297 8 LOS ANGELES SAN FRANCISCO 2 39.2 . 0.2196 7 ATLANTA CHICAGO 2 21.7 . 0.3715 6 CL7 CL9 4 14.5 3.4 0.3652 5 CL8 SEATTLE 3 12.4 7.3 0.5139 4 DENVER CL5 4 12.4 2.1 0.5337 3 CL6 MIAMI 5 14.2 3.8 0.5743 2 CL3 HOUSTON 6 22.1 2.6 0.6091 1 CL2 CL4 10 . 22.1 1.173 The CLUSTER ProcedureCentroid Hierarchical Cluster Analysis

  49. Centroid Hierarchical Cluster Analysis

  50. Mean Distance Between Observations = 1417.133 Cluster History NCL Clusters Joined FREQ NormMinDist Tie 9 NEW YORK WASHINGTON D.C. 2 0.1447 8 LOS ANGELES SAN FRANCISCO 2 0.2449 7 ATLANTA CL9 3 0.3832 6 CL7 CHICAGO 4 0.4142 5 CL6 MIAMI 5 0.4262 4 CL8 SEATTLE 3 0.4784 3 CL5 HOUSTON 6 0.4947 2 DENVER CL4 4 0.5864 1 CL3 CL2 10 0.6203 The CLUSTER ProcedureSingle Linkage Cluster Analysis

More Related