Lecture 6

Lecture 6 Statistical Lecture ─ Cluster Analysis

Cluster Analysis • Grouping similar objects to produce a classification • Useful when the priori the structure of the data is unknown • Involving the assessment of the relative distances between points

Clustering Algorithms • Partitioning : • Divide the data set into k clusters where k needs to be specified beforehand, e.g. • k-means.

Clustering Algorithms • Hierarchical : • Agglomerative methods : • Start with the situation where each object forms its own little cluster, and then successively merges clusters until only one large cluster left • Divisive methods : • Start by considering the whole data set as one cluster, and then splits up clusters until each object is separate

Caution • Most users are interested in the main structure of their data, consisting of a few large clusters • When forming larger clusters, agglomerative methods might makes wrong decisions in the first step. (Once one step is wrong, the whole thing is wrong) • For divisive methods, the larger clusters are determined first, so they are less likely to suffer from earlier steps

Agglomerative Hierarchical Clustering Procedure (1) Each observation begins in a cluster by itself (2) The two closest clusters are merged to from a new cluster that replaces the two old clusters (3) Repeat (2) until only one cluster is left The various clustering methods differ in how the distance between two clusters is computed.

Remarks • For coordinate data, variables with large variances tend to have more effect on the resulting clusters than those with small variance • Scaling or transforming the variables might be needed • Standardization (standardize the variables to mean 0 and standard deviation 1) or principle components is useful but not always appropriate • Outliers should be removed before analysis

Remarks(cont.) • Nonlinear transformations of the variables may change the number of population clusters and should therefore be approached with caution • For most applications, the variables should be transformed so that equal differences are of equal practical importance • An interval scale of measurement is required if raw data are used as input. Ordinal or ranked coordinate data are generally not appropriate

Notation n number of observation v number of variables if data are coordinates G number of clusters at any given level of the hierarchy xi ith observation Ckkth cluster, subset of {1, 2, …, n} Nk number of observations in Ck

Notation(cont.) sample mean vector mean vector for cluster Ck ||x|| Euclidean length of the vector x, that is the square root of the sum of the squares of the elements of x T Wk

Notation(cont.) PGWj, where summation is over the G clusters at the Gth level of the hierarchy Bkl Wm – Wk – Wl if Cm=CkCl d(x, y) any distance or dissimilarity measure between observations or vectors x and y Dkl any distance or dissimilarity measure between clusters Ck and Cl

Clustering Method ─ Average Linkage The distance between two clusters is defined by If d(x, y)=||x – y||2, then The combinatorial formula is if Cm=CkCl

Average Linkage • The distance between clusters is the average distance between pairs of observations, one in each cluster • It tends to join clusters with small variance and is slightly biased toward producing clusters with the same variance

Centroid Method The distance between two clusters is defined by If d(x, y)=||x – y||2, then the combinatorial formula is

Centroid Method • The distance between two clusters is defined as the squared Euclidean distance between their centroids or means • It is more robust to outliers than most other hierarchical methods but in other respects may not perform as well as Ward’s method or average linkage

Complete Linkage The distance between two clusters is defined by The combinatorial formula is

Complete Linkage • The distance between two cluster is the maximum distance between an observation in one cluster and an observation in the other cluster • It is strongly biased toward producing clusters with roughly equal diameters and can be severely distorted by moderate outliers

Single Linkage The distance between two clusters is defined by The combinatorial formula is

Single Linkage • The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster • It sacrifices performance in the recovery of compact clusters in return for the ability to detect elongated and irregular clusters

Ward’s Minimum-Variance Method The distance between two clusters is defined by If d(x, y)=||x – y||2, then the combinatorial formula is

Ward’s Minimum-Variance Method • The distance between two clusters is the ANOVA sum of squares between the two clusters added up over all the variables • It tends to join clusters with a small number of observation • It is strongly biased toward producing clusters with roughly the same number of observations • It is also very sensitive to outliers

Assumptions for WMVM • Multivariate normal mixture • Equal spherical covariance matrices • Equal sampling probabilities

Remarks • Single linkage tends to lead to the formation of long straggly clusters • Average, complete linkage and Ward’s method often find spherical clusters even when the data appear to contain clusters of other shapes

McQuitty’s Similarity Analysis The combinatorial formula is Median Method If d(x, y)=||x – y||2, then the combinatorial formula is

Kth-nearest Neighbor Method • Prespecify k • Let rk(x) be the distance from point x to the kth nearest observation • Consider a closed sphere centered at x with radius rk(x), say Sk(x)

Kth-nearest Neighbor Method • The estimated density at x is defined by • For any two observations xiand xj

K-Means Algorithm • It is intended for use with large data sets, from approximately 100 to 100000 observations • With small data sets, the results may be highly sensitive to the order of the observations in the data set • It combines an effective method for finding initial clusters with a standard iterative algorithm for minimizing the sum of squared distance from the cluster means

K-Means Algorithm • Specify the number of clusters, say k • A set of k points called cluster seeds is selected as a first guess of the means of the k clusters • Each observation is assigned to the nearest seed to form temporary clusters • The seeds are then replaced by the means of the temporary clusters • The process is repeated until no further changes occur in the clusters

Cluster Seeds • Select the first complete (no missing values) observation as the first seed • The next complete observation that is separated from the first seed by at least the prespecified distance becomes the second seed • Later observations are selected as new seeds if they are separated from all previous seeds by at least the radius, as long as the maximum number of seeds is not exceeded

Cluster Seeds If an observation is complete but fails to qualify as a new seed, two tests can be made to see if the observation can replace one of the old seeds

Cluster Seeds(cont.) • An old seed is replaced if the distance between the observation and the closest seed is greater than the minimum distance between seeds. The seed that is replaced is selected from the two seeds that are closest to each other. The seed that is replaced is the one of these two with the shortest distance to the closest of the remaining seed when the other seed is replaced by the current observation

Cluster Seeds(cont.) • If the observation fails the first test for seed replacement, a second test is made. The observation replaces the nearest seed if the smallest distance from the observation to all seeds other than the nearest one is greater than the shortest distance from the nearest seed to all other seeds. If this test is failed, go on to the next observation.

Dissimilarity Matrices n n dissimilarity matrix where d(i, j)=d(j, i) measures the “difference” or dissimilarity between the objects i and j.

Dissimilarity Matrices d usually satisfies • d(i, i) = 0 • d(i, j)  0 • d(i, j) = d(j, i)

Dissimilarity Interval-scaled variables-continuous measurements on a (roughly) linear scale (temperature, height, weight, etc.)

Dissimilarity(cont.) • The choice of measurement units strongly affects the resulting clustering • The variable with the large dispersion will have the largest impact on clustering • If all variables are considered equally important, the data need to be standardized first

Standardization • Mean absolute deviation (Robust) • Median absolute deviation (Robust) • Usual standard deviation

Continuous Ordinal Variables These are continuous measurements on an unknown scale, or where only the ordering is known but not the actual magnitude. • Replace the xif by their rank rif {1, …, Mf} • Transform the scale to [0,1] as follows : • Compute the dissimilarities as for interval-scaled variables

Ratio-Scaled Variables These are positive continuous measurements on a nonlinear scale, such as an exponential scale. One example would be the growth of a bacterial population (say, with a growth function AeBt). • Simple as interval-scaled variables, though this is not recommended as it can distort the measurement scale • As continuous ordinal data • By first transforming the data (perhaps by taking logarithms), and then treating the results as interval-scaled variables

Discrete Ordinal Variables A variable of this type has M possible values (scores) which are ordered. The dissimilarities are computed in the same way as for continuous ordinal variables.

Nominal Variables • Such a variable has M possible values, which are not ordered. • The dissimilarity between objects i and j is usually defined as

Symmetric Binary Variables Two possible values, coded 0 and 1, which are equally important (s.t. a male and female). Consider the contingency table of the objects i and j :

Asymmetric Binary Variables Two possible values, one of which carries more importance than the other. The most meaningful outcome is coded as 1, and the less meaningful outcome as 0. Typically, 1 stands for the presence of a certain attribute (e.g., a particular distance), and 0 for its absence.

Asymmetric Binary Variables

Cluster Analysis of Flying Mileages Between 10 American Cities 0 ATLANTA 587 0 CHICAGO 1212 920 0 DENVER 701 940 879 0 HOUSTON 1936 1745 831 1374 0 LOS ANGELES 604 1188 1726 968 2339 0 MIAMI 748 713 1631 1420 2451 1092 0 NEW YORK 2139 1858 949 1645 347 2594 2571 0 SAN FRANCISCO 2182 1737 1021 1891 959 2734 2408 678 0 SEATTLE 543 597 1494 1220 2300 923 205 2442 2329 0 WASHINGTON D.C.

Root-Mean-Square Distance Between Observations = 1580.242 Cluster History NCL Clusters Joined FREQ PSF PST2 NormRMSDist Tie 9 NEW YORK WASHINGTON D.C. 2 66.7 . 0.1297 8 LOS ANGELES SAN FRANCISCO 2 39.2 . 0.2196 7 ATLANTA CHICAGO 2 21.7 . 0.3715 6 CL7 CL9 4 14.5 3.4 0.4149 5 CL8 SEATTLE 3 12.4 7.3 0.5255 4 DENVER HOUSTON 2 13.9 . 0.5562 3 CL6 MIAMI 5 15.5 3.8 0.6185 2 CL3 CL4 7 16.0 5.3 0.8005 1 CL2 CL5 10 . 16.0 1.2967 The CLUSTER ProcedureAverage Linkage Cluster Analysis

Average Linkage Cluster Analysis

Root-Mean-Square Distance Between Observations = 1580.242 Cluster History NCL Clusters Joined FREQ PSF PST2 NormCentDist Tie 9 NEW YORK WASHINGTON D.C. 2 66.7 . 0.1297 8 LOS ANGELES SAN FRANCISCO 2 39.2 . 0.2196 7 ATLANTA CHICAGO 2 21.7 . 0.3715 6 CL7 CL9 4 14.5 3.4 0.3652 5 CL8 SEATTLE 3 12.4 7.3 0.5139 4 DENVER CL5 4 12.4 2.1 0.5337 3 CL6 MIAMI 5 14.2 3.8 0.5743 2 CL3 HOUSTON 6 22.1 2.6 0.6091 1 CL2 CL4 10 . 22.1 1.173 The CLUSTER ProcedureCentroid Hierarchical Cluster Analysis

Centroid Hierarchical Cluster Analysis

Mean Distance Between Observations = 1417.133 Cluster History NCL Clusters Joined FREQ NormMinDist Tie 9 NEW YORK WASHINGTON D.C. 2 0.1447 8 LOS ANGELES SAN FRANCISCO 2 0.2449 7 ATLANTA CL9 3 0.3832 6 CL7 CHICAGO 4 0.4142 5 CL6 MIAMI 5 0.4262 4 CL8 SEATTLE 3 0.4784 3 CL5 HOUSTON 6 0.4947 2 DENVER CL4 4 0.5864 1 CL3 CL2 10 0.6203 The CLUSTER ProcedureSingle Linkage Cluster Analysis

Lecture 6

Lecture 6

Presentation Transcript

Lecture 6

LECTURE № 6

Lecture 6

Lecture 6

Lecture 6

Lecture 6

Lecture 6:

Lecture 6

Lecture 6

Lecture 6

LECTURE 6

Lecture 6

LECTURE 6

Lecture 6

Lecture 6

Lecture 6

Lecture 6

Lecture 6: