Cluster validation
This presentation is the property of its rightful owner.
Sponsored Links
1 / 23

Cluster Validation PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on
  • Presentation posted in: General

Cluster Validation. Cluster validation Assess the quality and reliability of clustering results. Why validation? To avoid finding clusters formed by chance To compare clustering algorithms To choose clustering parameters e.g., the number of clusters in the K-means algorithm. DBSCAN.

Download Presentation

Cluster Validation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cluster validation

Cluster Validation

  • Cluster validation

    • Assess the quality and reliability of clustering results.

  • Why validation?

    • To avoid finding clusters formed by chance

    • To compare clustering algorithms

    • To choose clustering parameters

      • e.g., the number of clusters in the K-means algorithm


Clusters found in random data

DBSCAN

K-means

Complete Link

Clusters found in Random Data

Random Points


Aspects of cluster validation

Aspects of Cluster Validation

  • Comparing the clustering results to ground truth (externally known results).

    • External Index

  • Evaluating the quality of clusters without reference to external information.

    • Use only the data

    • Internal Index

  • Determining thereliabilityof clusters.

    • To what confidence level, the clusters are not formed by chance

    • Statistical framework


Comparing to ground truth

Comparing to Ground Truth

  • Notation

    • N: number of objects in the data set;

    • P={P1,…,Pm}: the set of “ground truth” clusters;

    • C={C1,…,Cn}: the set of clusters reported by a clustering algorithm.

  • The “incidence matrix”

    • N  N (both rows and columns correspond to objects).

    • Pij = 1 if Oi and Oj belong to the same “ground truth” cluster in P; Pij=0 otherwise.

    • Cij = 1 if Oi and Oj belong to the same cluster in C; Cij=0 otherwise.


External index

External Index

  • A pair of data object (Oi,Oj) falls into one of the following categories

    • SS: Cij=1 and Pij=1; (agree)

    • DD: Cij=0 and Pij=0; (agree)

    • SD: Cij=1 and Pij=0; (disagree)

    • DS: Cij=0 and Pij=1; (disagree)

  • Rand index

    • may be dominated by DD

  • Jaccard Coefficient


Internal index

Internal Index

  • “Ground truth” may be unavailable

  • Use only the data to measure cluster quality

    • Measure the “homogeneity” and “separation” of clusters.

      • SSE: Sum of squared errors.

    • Calculate the correlation between clustering results and distance matrix.


Sum of squared error

Sum of Squared Error

  • Homogeneity is measured by the within cluster sum of squares

    Exactly the objective function of K-means.

  • Separation is measured by the between cluster sum of squares

    Where |Ci| is the size of cluster i, m is the centroid of the whole data set.

  • BSS + WSS = constant

  • A larger number of clusters tend to result in smaller WSS.


Sum of squared error1

m

1

m1

2

3

4

m2

5

Sum of Squared Error

K=1 :

K=2 :

K=4:


Sum of squared error2

Sum of Squared Error

  • Can also be used to estimate the number of clusters.


Internal measures sse

Internal Measures: SSE

  • SSE curve for a more complicated data set

SSE of clusters found using K-means


Correlation with distance matrix

Correlation with Distance Matrix

  • Distance Matrix

    • Dij is the similarity between object Oi and Oj.

  • Incidence Matrix

    • Cij=1 if Oi and Oj belong to the same cluster, Cij=0 otherwise

  • Compute the correlation between the two matrices

    • Only n(n-1)/2 entries needs to be calculated.

  • High correlation indicates good clustering.


Correlation with distance matrix1

Correlation with Distance Matrix

  • Given Distance Matrix D = {d11,d12, …, dnn } and Incidence Matrix C= { c11, c12,…, cnn } .

  • Correlation r between D and C is given by


Measuring cluster validity via correlation

Measuring Cluster Validity Via Correlation

  • Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.

Corr = -0.9235

Corr = -0.5810


Clusters found in random data1

DBSCAN

K-means

Complete Link

Clusters found in Random Data

Random Points


Using similarity matrix for cluster validation

Using Similarity Matrix for Cluster Validation

  • Order the similarity matrix with respect to cluster labels and inspect visually.


Using similarity matrix for cluster validation1

Using Similarity Matrix for Cluster Validation

  • Clusters in random data are not so crisp

K-means


Using similarity matrix for cluster validation2

Using Similarity Matrix for Cluster Validation

  • Clusters in random data are not so crisp

Complete Link


Using similarity matrix for cluster validation3

Using Similarity Matrix for Cluster Validation

  • Clusters in random data are not so crisp

DBSCAN


Reliability of clusters

Reliability of Clusters

  • Need a framework to interpret any measure.

    • For example, if our measure of evaluation has the value, 10, is that good, fair, or poor?

  • Statistics provide a framework for cluster validity

    • The more “atypical” a clustering result is, the more likely it represents valid structure in the data.


Statistical framework for sse

Statistical Framework for SSE

  • Example

    • Compare SSE of 0.005 against three clusters in random data

    • SSE Histogram of 500 sets of random data points of size 100 distributed over the range 0.2 – 0.8 for x and y values

SSE = 0.005


Statistical framework for correlation

Statistical Framework for Correlation

  • Correlation of incidence and distance matrices for the K-means of the following two data sets.

Correlation histogram of random data

Corr = -0.5810

Corr = -0.9235


Hyper geometric distribution

Hyper-geometric Distribution

  • Given the total number of genes in the data set associated with term T is M, if randomly draw n genes from the data set N, what is the probability that m of the selected n genes will be associated with T?


P value

P-Value

Based on Hyper-geometric distribution, the probability of having m genes or fewer associated to T in N can be calculated by summing the probabilities of a random list of N genes having 1, 2, …, m genes associated to T. So the p-value of over-representation is as follows:


  • Login