Cluster analysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

Cluster analysis PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on
  • Presentation posted in: General

Cluster analysis. 포항공과대학교 산업공학과 확률통계연구실 이 재 현. Definition. Cluster analysis is a technigue used for combining observations into groups or clusters such that Each group or cluster is homogeneous or compact with respect to certain characteristics

Download Presentation

Cluster analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cluster analysis

Cluster analysis

포항공과대학교

산업공학과

확률통계연구실

이 재 현


Definition

Definition

  • Cluster analysis is a technigue used for combining observations into groups or clusters such that

    • Each group or cluster is homogeneous or compact with respect to certain characteristics

    • Each group should be different from other groups with respect to the same characteristics

  • Example

    • A marketing manager is interested in identifying similar cities that can be used for test marketing

    • The campaign manager for a political candidate is interested in identifying groups of votes who have similar views on important issues


Objective of clustering analysis

Objective of clustering analysis

  • The objective of cluster analysis is to group observation into clusters such that each cluster is as homogeneous as possible with respect to the clustering variables

  • overview of cluster analysis

    • step 1 ; n objects measured on p variables

    • step 2 ; Transform to n * n similarity(distance)

      matrix

    • step 3 ; Cluster formation

      (Hierarchical or nonhierarchical clusters)

    • step 4 ; Cluster profile


Key problem

Key problem

  • Measure of similarity

    • Fundamental to the use of any clustering technique is the computation of a measure of similarity to distance between the respective objects.

    • Distance-type measures – Euclidean distance for standardized data, Mahalanobis

      distance

    • Matching-type measures – Association coefficients, correlation coefficients

  • A procedure for forming the clusters

    • Hierarchical clustering – Centroid method, Single-linkage method, Complete-linkage method, Average-linkage method, Ward’s method.

    • Nonhierarchical clustering – k-means clustering


Similarity measure distance type

Similarity Measure – Distance type

  • Minkowski metric

    • If r = 2, then Euclidean distance

    • if r = 1, then absolute distance

  • consider below example


Similarity measure distance type1

Similarity Measure – Distance type

  • Euclidean distance for standardized data

    • To make scale invariant data

    • The squared euclidean distance is weighted by

  • Mahalanobis distance

    x is p*1 vector, S is a p*p covariance matrix

    • It is designed to take into account the correlation among the variables and is also scale invariant.


Similarity measure matching type

Similarity Measure – Matching type

  • Association coefficients

    • This type of measure is used to represent similarity for binary variables

    • Similarity coefficients


Similarity measure matching type1

Similarity Measure – Matching type

  • Correlation coefficient

    • Pearson product moment correlation coefficient is used for measure of similarity.

    • dAB = 1, dAC = 0.82


Hierarchical clustering

Hierarchical clustering

  • Centroid method

    • Each group is replaced by Average Subject which is the centroid of that group


Hierarchical clustering1

Hierarchical clustering


Hierarchical clustering2

Hierarchical clustering

  • Single-Linkage method

    • The distance between two clusters is represented by the minimum of the distance between all possible pairs of subjects in the two clusters

      = 181 and = 145

      = 221 and = 181


Hierarchical clustering3

Hierarchical clustering

  • Complete-Linkage method

    • The distance between two clusters is defined as the maximum of the distances between all possible pairs of observations in the two clusters

      = 181 and = 145

      = 625 and = 557


Hierarchical clustering4

Hierarchical clustering

  • Average-Linkage method

    • The distance between two clusters is obtained by taking the average distance between all pairs of subjects in the two clusters

      and

      (181 + 145) / 2 = 163


Hierarchical clustering5

Hierarchical clustering

  • Ward’s method

    • It forms clusters by maximizing within-clusters homogeneity. The within-group sum of squares is used as the measure of homogeneity. The Ward’s method tries to minimize the total within-group or within-cluster sums of squares


Evaluating the cluster solution and determining the number of cluster

Evaluating the cluster solution and determining the number of cluster

  • Root-mean-square standard deviation(RMSSTD)of the new cluster

    • RMSSTD if the pooled standard deviation of all the variables forming the cluster.

      pooled variance = pooled SS for all the variables / pooled degrees of freedom for all the variables

  • R-Squared(RS)

    • RS is the ratio of SSb to SSt (SSt = SSb + SSw)

      RS of CL2 is (701.166 – 184.000) / 701.166 = 0.7376


Evaluating the cluster solution and determining the number of cluster1

Evaluating the cluster solution and determining the number of cluster

  • Semipartial R-Squared (SPR)

    • The sum of pooled SSw’s of cluster joined to obtain the new cluster is called loss of homogeneity. If loss of homogeneity is large then the new cluster is obtained by merging two heterogeneous clusters.

    • SPR is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster.

    • SPR of CL2 is (183 – (1 – 13)) / 701.166 = 0.241

  • Distance between clusters

    • It is simply the euclidean distance between the centroids of the two clusters that are to be joined or merger and it is termed the centroid distance (CD)


Evaluating the cluster solution and determining the number of cluster2

Evaluating the cluster solution and determining the number of cluster

  • Summary of the statistics for evaluating cluster solution


Nonhierarchical clustering

Nonhierarchical clustering

  • The data are divided into k partitions or groups with each partition representing a cluster. The number of clusters must be known a priori.

  • Step

    • Select k initial cluster centroids or seeds, where k is number of clusters desired.

    • Assign each observation to the cluster to which it is the closest.

    • Reassign or reallocate each observation to one of the k clusters according to a predetermined stopping rule.

    • Stop if there is no reallocation of data points or if the reassignment satisfies the criteria set by the stopping rule. Otherwise go to Step 2.

  • Difference

    • the method used for obtaining initial cluster centroids or seeds

    • the rule used for reassigning observations


Nonhierarchical clustering1

Nonhierarchical clustering

  • Algorithm 1

    • step

      • select the first k observation as cluster center

      • compute the centroid of each cluster

      • reassigned by computing the distance of each observation


Nonhierarchical clustering2

Nonhierarchical clustering

  • Algorithm 2

    • step

      • select the first k observation as cluster center

      • seeds are replaced by remaining observation.

      • reassigned by computing the distance of each observation

  • {1}, {2}, {3}

  • {1}, {2}, {3, 4}

  • {1, 2}, {5}, {3, 4}

  • {1, 2}, {5, 6}, {3, 4}


Nonhierarchical clustering3

Nonhierarchical clustering

  • Algorithm 3

    • selecting the initial seeds

      Sum(i) be the sum of the values of the variables

    • Minimizes the ESS

Change in ESS = 3[(5-27.5)2 + (5-19.5)2]/2 – [(5-5.5)2 + (5-5.5)2]/2

increase decrease


Which clustering method is best

Which clustering method is best

  • Hierarchical methods

    • advantage ; Do not require a priori knowledge of the number of clusters of the starting partition.

    • disadvantage ; Once an observation is assigned to a cluster it cannot be reassigned to another cluster.

  • Nonhierarchical methods

    • The cluster centers or the initial partition has to be identified before the technique can proceed to cluster observations. The nonhierarchical clustering algorithms, in general, are very sensitive to the initial partition.

    • k-mean algorithm and other nonhierarchical clustering algorithms perform poorly when random initial partitions are used. However, their performance is much superior when the results from hierarchical methods are used to form the initial partition.

  • Hierarchical and nonhierarchical techniques should be viewed an complementary clustering techniques rather than as competing techniques.


  • Login