- 78 Views
- Uploaded on
- Presentation posted in: General

Cluster analysis

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Cluster analysis

포항공과대학교

산업공학과

확률통계연구실

이 재 현

- Cluster analysis is a technigue used for combining observations into groups or clusters such that
- Each group or cluster is homogeneous or compact with respect to certain characteristics
- Each group should be different from other groups with respect to the same characteristics

- Example
- A marketing manager is interested in identifying similar cities that can be used for test marketing
- The campaign manager for a political candidate is interested in identifying groups of votes who have similar views on important issues

- The objective of cluster analysis is to group observation into clusters such that each cluster is as homogeneous as possible with respect to the clustering variables
- overview of cluster analysis
- step 1 ; n objects measured on p variables
- step 2 ; Transform to n * n similarity(distance)
matrix

- step 3 ; Cluster formation
(Hierarchical or nonhierarchical clusters)

- step 4 ; Cluster profile

- Measure of similarity
- Fundamental to the use of any clustering technique is the computation of a measure of similarity to distance between the respective objects.
- Distance-type measures – Euclidean distance for standardized data, Mahalanobis
distance

- Matching-type measures – Association coefficients, correlation coefficients

- A procedure for forming the clusters
- Hierarchical clustering – Centroid method, Single-linkage method, Complete-linkage method, Average-linkage method, Ward’s method.
- Nonhierarchical clustering – k-means clustering

- Minkowski metric
- If r = 2, then Euclidean distance
- if r = 1, then absolute distance

- consider below example

- Euclidean distance for standardized data
- To make scale invariant data
- The squared euclidean distance is weighted by

- Mahalanobis distance
x is p*1 vector, S is a p*p covariance matrix

- It is designed to take into account the correlation among the variables and is also scale invariant.

- Association coefficients
- This type of measure is used to represent similarity for binary variables
- Similarity coefficients

- Correlation coefficient
- Pearson product moment correlation coefficient is used for measure of similarity.
- dAB = 1, dAC = 0.82

- Centroid method
- Each group is replaced by Average Subject which is the centroid of that group

- Single-Linkage method
- The distance between two clusters is represented by the minimum of the distance between all possible pairs of subjects in the two clusters
= 181 and = 145

= 221 and = 181

- The distance between two clusters is represented by the minimum of the distance between all possible pairs of subjects in the two clusters

- Complete-Linkage method
- The distance between two clusters is defined as the maximum of the distances between all possible pairs of observations in the two clusters
= 181 and = 145

= 625 and = 557

- The distance between two clusters is defined as the maximum of the distances between all possible pairs of observations in the two clusters

- Average-Linkage method
- The distance between two clusters is obtained by taking the average distance between all pairs of subjects in the two clusters
and

(181 + 145) / 2 = 163

- The distance between two clusters is obtained by taking the average distance between all pairs of subjects in the two clusters

- Ward’s method
- It forms clusters by maximizing within-clusters homogeneity. The within-group sum of squares is used as the measure of homogeneity. The Ward’s method tries to minimize the total within-group or within-cluster sums of squares

- Root-mean-square standard deviation(RMSSTD)of the new cluster
- RMSSTD if the pooled standard deviation of all the variables forming the cluster.
pooled variance = pooled SS for all the variables / pooled degrees of freedom for all the variables

- RMSSTD if the pooled standard deviation of all the variables forming the cluster.
- R-Squared(RS)
- RS is the ratio of SSb to SSt (SSt = SSb + SSw)
RS of CL2 is (701.166 – 184.000) / 701.166 = 0.7376

- RS is the ratio of SSb to SSt (SSt = SSb + SSw)

- Semipartial R-Squared (SPR)
- The sum of pooled SSw’s of cluster joined to obtain the new cluster is called loss of homogeneity. If loss of homogeneity is large then the new cluster is obtained by merging two heterogeneous clusters.
- SPR is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster.
- SPR of CL2 is (183 – (1 – 13)) / 701.166 = 0.241

- Distance between clusters
- It is simply the euclidean distance between the centroids of the two clusters that are to be joined or merger and it is termed the centroid distance (CD)

- Summary of the statistics for evaluating cluster solution

- The data are divided into k partitions or groups with each partition representing a cluster. The number of clusters must be known a priori.
- Step
- Select k initial cluster centroids or seeds, where k is number of clusters desired.
- Assign each observation to the cluster to which it is the closest.
- Reassign or reallocate each observation to one of the k clusters according to a predetermined stopping rule.
- Stop if there is no reallocation of data points or if the reassignment satisfies the criteria set by the stopping rule. Otherwise go to Step 2.

- Difference
- the method used for obtaining initial cluster centroids or seeds
- the rule used for reassigning observations

- Algorithm 1
- step
- select the first k observation as cluster center
- compute the centroid of each cluster
- reassigned by computing the distance of each observation

- step

- Algorithm 2
- step
- select the first k observation as cluster center
- seeds are replaced by remaining observation.
- reassigned by computing the distance of each observation

- step

- {1}, {2}, {3}
- {1}, {2}, {3, 4}
- {1, 2}, {5}, {3, 4}
- {1, 2}, {5, 6}, {3, 4}

- Algorithm 3
- selecting the initial seeds
Sum(i) be the sum of the values of the variables

- Minimizes the ESS

- selecting the initial seeds

Change in ESS = 3[(5-27.5)2 + (5-19.5)2]/2 – [(5-5.5)2 + (5-5.5)2]/2

increase decrease

- Hierarchical methods
- advantage ; Do not require a priori knowledge of the number of clusters of the starting partition.
- disadvantage ; Once an observation is assigned to a cluster it cannot be reassigned to another cluster.

- Nonhierarchical methods
- The cluster centers or the initial partition has to be identified before the technique can proceed to cluster observations. The nonhierarchical clustering algorithms, in general, are very sensitive to the initial partition.
- k-mean algorithm and other nonhierarchical clustering algorithms perform poorly when random initial partitions are used. However, their performance is much superior when the results from hierarchical methods are used to form the initial partition.

- Hierarchical and nonhierarchical techniques should be viewed an complementary clustering techniques rather than as competing techniques.