Cluster analysis

1 / 22

# Cluster analysis - PowerPoint PPT Presentation

Cluster analysis. 포항공과대학교 산업공학과 확률통계연구실 이 재 현. Definition. Cluster analysis is a technigue used for combining observations into groups or clusters such that Each group or cluster is homogeneous or compact with respect to certain characteristics

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Cluster analysis' - zephania-lynn

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Cluster analysis

포항공과대학교

산업공학과

확률통계연구실

이 재 현

Definition
• Cluster analysis is a technigue used for combining observations into groups or clusters such that
• Each group or cluster is homogeneous or compact with respect to certain characteristics
• Each group should be different from other groups with respect to the same characteristics
• Example
• A marketing manager is interested in identifying similar cities that can be used for test marketing
• The campaign manager for a political candidate is interested in identifying groups of votes who have similar views on important issues
Objective of clustering analysis
• The objective of cluster analysis is to group observation into clusters such that each cluster is as homogeneous as possible with respect to the clustering variables
• overview of cluster analysis
• step 1 ; n objects measured on p variables
• step 2 ; Transform to n * n similarity(distance)

matrix

• step 3 ; Cluster formation

(Hierarchical or nonhierarchical clusters)

• step 4 ; Cluster profile
Key problem
• Measure of similarity
• Fundamental to the use of any clustering technique is the computation of a measure of similarity to distance between the respective objects.
• Distance-type measures – Euclidean distance for standardized data, Mahalanobis

distance

• Matching-type measures – Association coefficients, correlation coefficients
• A procedure for forming the clusters
• Nonhierarchical clustering – k-means clustering
Similarity Measure – Distance type
• Minkowski metric
• If r = 2, then Euclidean distance
• if r = 1, then absolute distance
• consider below example
Similarity Measure – Distance type
• Euclidean distance for standardized data
• To make scale invariant data
• The squared euclidean distance is weighted by
• Mahalanobis distance

x is p*1 vector, S is a p*p covariance matrix

• It is designed to take into account the correlation among the variables and is also scale invariant.
Similarity Measure – Matching type
• Association coefficients
• This type of measure is used to represent similarity for binary variables
• Similarity coefficients
Similarity Measure – Matching type
• Correlation coefficient
• Pearson product moment correlation coefficient is used for measure of similarity.
• dAB = 1, dAC = 0.82
Hierarchical clustering
• Centroid method
• Each group is replaced by Average Subject which is the centroid of that group
Hierarchical clustering
• The distance between two clusters is represented by the minimum of the distance between all possible pairs of subjects in the two clusters

= 181 and = 145

= 221 and = 181

Hierarchical clustering
• The distance between two clusters is defined as the maximum of the distances between all possible pairs of observations in the two clusters

= 181 and = 145

= 625 and = 557

Hierarchical clustering
• The distance between two clusters is obtained by taking the average distance between all pairs of subjects in the two clusters

and

(181 + 145) / 2 = 163

Hierarchical clustering
• Ward’s method
• It forms clusters by maximizing within-clusters homogeneity. The within-group sum of squares is used as the measure of homogeneity. The Ward’s method tries to minimize the total within-group or within-cluster sums of squares
• Root-mean-square standard deviation(RMSSTD)of the new cluster
• RMSSTD if the pooled standard deviation of all the variables forming the cluster.

pooled variance = pooled SS for all the variables / pooled degrees of freedom for all the variables

• R-Squared(RS)
• RS is the ratio of SSb to SSt (SSt = SSb + SSw)

RS of CL2 is (701.166 – 184.000) / 701.166 = 0.7376

• Semipartial R-Squared (SPR)
• The sum of pooled SSw’s of cluster joined to obtain the new cluster is called loss of homogeneity. If loss of homogeneity is large then the new cluster is obtained by merging two heterogeneous clusters.
• SPR is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster.
• SPR of CL2 is (183 – (1 – 13)) / 701.166 = 0.241
• Distance between clusters
• It is simply the euclidean distance between the centroids of the two clusters that are to be joined or merger and it is termed the centroid distance (CD)
• Summary of the statistics for evaluating cluster solution
Nonhierarchical clustering
• The data are divided into k partitions or groups with each partition representing a cluster. The number of clusters must be known a priori.
• Step
• Select k initial cluster centroids or seeds, where k is number of clusters desired.
• Assign each observation to the cluster to which it is the closest.
• Reassign or reallocate each observation to one of the k clusters according to a predetermined stopping rule.
• Stop if there is no reallocation of data points or if the reassignment satisfies the criteria set by the stopping rule. Otherwise go to Step 2.
• Difference
• the method used for obtaining initial cluster centroids or seeds
• the rule used for reassigning observations
Nonhierarchical clustering
• Algorithm 1
• step
• select the first k observation as cluster center
• compute the centroid of each cluster
• reassigned by computing the distance of each observation
Nonhierarchical clustering
• Algorithm 2
• step
• select the first k observation as cluster center
• seeds are replaced by remaining observation.
• reassigned by computing the distance of each observation
• {1}, {2}, {3}
• {1}, {2}, {3, 4}
• {1, 2}, {5}, {3, 4}
• {1, 2}, {5, 6}, {3, 4}
Nonhierarchical clustering
• Algorithm 3
• selecting the initial seeds

Sum(i) be the sum of the values of the variables

• Minimizes the ESS

Change in ESS = 3[(5-27.5)2 + (5-19.5)2]/2 – [(5-5.5)2 + (5-5.5)2]/2

increase decrease

Which clustering method is best
• Hierarchical methods
• advantage ; Do not require a priori knowledge of the number of clusters of the starting partition.
• disadvantage ; Once an observation is assigned to a cluster it cannot be reassigned to another cluster.
• Nonhierarchical methods
• The cluster centers or the initial partition has to be identified before the technique can proceed to cluster observations. The nonhierarchical clustering algorithms, in general, are very sensitive to the initial partition.
• k-mean algorithm and other nonhierarchical clustering algorithms perform poorly when random initial partitions are used. However, their performance is much superior when the results from hierarchical methods are used to form the initial partition.
• Hierarchical and nonhierarchical techniques should be viewed an complementary clustering techniques rather than as competing techniques.