Cluster Analysis

Cluster Analysis

First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories.

集群分析目的 對資料作簡化的工作與分類，也就是把相似的個體(觀察物)歸於一群。將事物根據某些屬性歸集在各個群體之中，使在同一個集群內的事物都具有相同的特性(homogeneity)，而在不同的集群之間卻有顯著的差異性。

若從幾何圖形來看。同一集群內的分子應聚集在一起，而不同集群的分子應該彼此遠離。若從幾何圖形來看。同一集群內的分子應聚集在一起，而不同集群的分子應該彼此遠離。

集群分析應用 • 教學應用 • 醫學界 • 社會學 • 心理學 • 經濟學 • 生物學

集群分析的計算 • 利用觀察體的『距離』資料或『相似性』資料為根據。兩者的『距離』量數愈小，則兩個觀察體在某方面就愈類似，『相似性』的量數也就愈大。 • 利用計算出來的『距離矩陣』或『相似係數矩陣』，便可根據某些標準將N個觀察體依次加以歸併最後可以聚集成幾個代表性的集群。

距離式的衡量： • 以點與點之間的距離為測度，較常採用的方法為歐幾里德距離(Euclidean Distance)： • 若有N個觀察體，每個觀察體有M個屬性，則令X是N*M的資料矩陣，點與點的歐幾里德距離為

dij= 如果各屬性的衡量單位不同，則在計算歐幾里德距離前宜將各變數之數值予以標準化，使其平均數為0，而標準差為1。

相似性的衡量： • 相似性愈大表示兩種觀察體之相異姓愈小，因而再相似性矩陣運算中，要將相似性數值愈大的集群先加以合併。 • 兩觀察體之間的相似性可用下述配合係數(matching coefficient)來衡量：

a為觀察i與j共同具有的屬性數目； b為觀察i與j共同不具有的屬性數目 m為屬性的總數。 Ex: 設i與j具備的屬性如下：(1代表具備該屬性，0代表不具備該屬性)

則配合係數為：

集群分析運算的方法 • 非階層式 (non-hierarchical)的集群分析:直接由距離或相似性矩陣開始運算，可分為下列幾種： a.連續關鍵值法(sequential threshold)：使用本法時，事先要挑選一個集群核心，並訂定一個關鍵值，所有與此一中心點之距離在某一預定關鍵值內的各觀察點即形一集群；然後再選擇另一新的集群核心，對尚未歸入集群之各觀察點則歸入第二集群，如此依次連續進行。

b.平行關鍵值法(paralleled threshold) ： 此法一開始就同時將幾個集群核心選定並訂定關鍵值，然後根據關鍵值，將各個觀察點歸入最近的集群中心，形成各集群。同時關鍵值亦可加以調整，以允許較多(或較少)的觀察點進入各集群中。 c.最適劃分法(optimizing partitioning)：此法是以某一效標 (如平均之群內距離為最小) 為基礎，不斷嘗試各種分類，直到效標值 (criterion measure) 達到最佳值為止。

d.平均數法(K-means Method)： 此法是上述方法的一種整合應用，其步驟是將各觀察值分割為K個集群，然後計算觀察體到各集群重心的距離，並將各觀察體分派到距離最近的集群內。重新計算得到新觀察體與喪失該觀察體的集群重心，再依各觀察體到各集群重心的距離。如此反覆計算，直到各群沒有須重新分配的觀察體為止。

階層式 (hierarchical)的集群分析 特性是每一個新的集群，都是由前一階層形成的集群而集結或分裂而成，因此集群分析後可形成一個樹狀結構。在階層式分裂法中，常見的方法為平均距離分裂法，其分析步驟:

先找出一個與其他觀察體平均距離最遠者，將此觀察體稱為分裂群，其餘的觀察體稱之為主要群，然後計算分裂群與主要群間、以及主要群之內各觀察體之間的距離。先找出一個與其他觀察體平均距離最遠者，將此觀察體稱為分裂群，其餘的觀察體稱之為主要群，然後計算分裂群與主要群間、以及主要群之內各觀察體之間的距離。若主要群之間某一觀察體與主要群其它觀察體的距離，大於此觀察體與分裂群的距離，則將之歸入分裂群，反之則留在主要群中。

以K-means計算步驟說明之： 1.將各個觀察體分割成K個原始集群 2.計算某一觀察體到各集群中心（平均數）距離（通常採用歐氏距離），接著將一些觀察體分派到距離最近的那個集群。最後則重新計算得到新觀察體及喪失該等觀察體的兩個集群之新中心。 3.重複第二步驟，直到各觀察體都不必重新分派到其他集群為止。

Ex:四個觀察體在兩個變項上的數量分布 X1= X1= 首先將此四個觀察體任意分割成兩個集群，如集群【1，2】及集群【3，4】，然後計算這兩個集群的形心之座標如下：集群【1，2】集群【3，4】

X2= X2= 接著計算每一觀察體到各集群形心的歐氏距離，並將其分派到距離最近的集群。如D21【1，2】=(12-2)2+(8-6)2=104 由上述計算可知：觀察體4與集群【3，4】距離較近，故不必重新分派；觀察體2與集群【3，4】的距離較近，故將之分派到集群【3，4】而得到新的兩個集群【1】、【2，3，4】，其形心之座標如下：

集群【1】集群【2，3，4】 X1=12 X1= X2=8 X2= 繼續計算各觀察體到集群【1】及集群【2，3，4】的歐氏距離

由資料顯示：觀察體1與集群【1】距離最近；觀察體2，3，4與集群【2，3，4】之距離最近，所以不需再重新計算分派，而得到K=2個集群，分別為集群【1】及集群【2，3，4】。由資料顯示：觀察體1與集群【1】距離最近；觀察體2，3，4與集群【2，3，4】之距離最近，所以不需再重新計算分派，而得到K=2個集群，分別為集群【1】及集群【2，3，4】。

Two-stage Cluster Sampling When Clusters are of Unequal Size • Desired Sample Proportion p=n/N • a: Desired # of Clusters Selected in the 1st Stage • A: Total # of Clusters • b: Sample Size within Each Cluster Selected • Ni : # of Elements in Cluster i

Simple Two-stage Cluster Sampling • The First-stage Prob. p1=a/A • The Second-stage Prob. p2=p(a/A) • Sample size in cluster I, ni =p2*Ni

Probability Proportional to Size where

Example • Draw a sample of 1,000 households from a city that contains about 200,000 households distributed among 2000 blocks of unequal but known size. • The desired sample proportion =1/200 • The desired # of clusters selected in the 1st stage=100 • How do we conduct the two-stage cluster sampling?

What is Cluster Analysis? • Cluster Analysis is a class of statistical techniques that can be applied to data that exhibit natural groupings. • CA is an interdependence technique that makes no distinction between dependent and independent variables. • There is NO statistical significance testing in CA. • CA is more a group of different algorithms that put objects into clusters following “well-defined similarity rules.”

What is A Cluster? • A cluster is a group of relatively homogeneous cases and observations. • Clusters exhibit high internal homogeneity and high external heterogeneity.

A Cluster Diagram: Drinker’s Perceptions of Alcohol

Characteristics of CA • Cluster Analysis is a tool of discovery. • It discovers structures in data but does NOT explain why they exist. • CA is used when we do not have an a priori hypothesis, but when we are in the exploratory phase.

How does CA differ… • From Discriminant Analysis • A dependence technique • Predict the probability that an object will fall into one of two or more mutually exclusive categories based on several independent variables. • Find a linear combination of independent variables. • Find natural groupings based on distances among objects.

From Factor Analysis • Similar to cluster analysis in that it is an interdependence technique. • Primary difference lies in the focus on objects and variables. • Factor analysis reduces variables to a few factors. Cluster analysis reduces objects to a few clusters.

Cluster Analysis Methods • Three Cluster Analysis Methods • Joining (Tree Clustering) • Two-way Joining • K-means Testing

Joining (Tree Clustering) • A type of hierarchical clustering -- agglomerative • Each unit is a cluster. • Dendogram  • Many other methods

The first level shows all samples xi as singleton clusters. Increase levels, more samples are clustered together in a hierarchical manner.

It is based on sets where each cluster level may contain sets that are subclusters as shown in the Venndiagram.

Two-way Joining Hartigan (1975) • Two-way Joining tries to cluster both variables and objects. • Only useful if you think clustering along BOTH lines will be useful. • Very rare in application.

k-Means Clustering: • Begin with a preconception about the number of clusters (k). • Thought of as ANOVA in reverse. • ANOVA evaluates between group var. against within group var. when computing stat. signif. of hypothesis that groups are different. • In k-Means the computer will try to move objects in and out of the groups to get the most significant ANOVA results.

It’s all about distance… • Distance Measures • Euclidean Distance • Squared Euclidean Distance • Manhattan Distance • Chebychev Distance • Power Distance

EQUATION: Euclidean Distance • Basic equation for determining distance measure. • Distance (x,y) = {Σi (xi –yi)2}1/2 • A standard formula for determining the distance between two points on a plane

Fairly simple, right?

In other words, how do we get from this…

To this…

How to Determine Clusters. • Use a computer. • Call a professional.

Clusters in the Real World

Why is Cluster Analysis Important? • Relatively new/evolving technique • Highly useful for market segmentation • Segmentation = identifying groupings of customers using statistical multi-variate analysis, often based on perceptions and attitudes as well as demographics and behavior. • Segmentation helpful to small companies attempting to carve out a niche • Large companies trying to tailor their products/services to different segments

In addition to segmentation, clusters are used to… • Design products and establish brands • Target direct mail • Make decisions about customer conversion and retention • Decide on marketing cost levels

Ex: Luxury Car Customers • Demographic examples easier to illustrate • Demographics: • Gender • Education • Age • 149 customers (objects) of a luxury car dealership

Using SPSS for Clustering • Chose “TwoStep Cluster Analysis” • Basically, the agglomerative technique (dendogram). • Step One: Creates very small (individual) sub-clusters. • Step Two: Cluster sub-clusters into desired number of clusters. • Automatically finds optimum number of clusters.

Two-Step CA Output What are these clusters?

Cluster Analysis