Clustering Algorithms

1 / 67

# Clustering Algorithms - PowerPoint PPT Presentation

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS. Clustering Algorithms.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Clustering Algorithms' - nolcha

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

CZ5225: Modeling and Simulation in BiologyLecture 3: Clustering Analysis for Microarray Data IProf. Chen Yu ZongTel: 6874-6877Email: yzchen@cz3.nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUS

Clustering Algorithms
• Be weary - confounding computational artifacts are associated with all clustering algorithms. -You should always understand the basic concepts behind an algorithm before using it.
• Anything will cluster! Garbage In means Garbage Out.
Supervised vs. Unsupervised Learning
• Supervised: there is a teacher, class labels are known
• Support vector machines
• Backpropagation neural networks
• Unsupervised: No teacher, class labels are unknown
• Clustering
• Self-organizing maps
Gene Expression Data

Gene expression data on p genes for n samples

mRNA samples

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...

2 -0.10 0.49 0.24 0.06 0.46 ...

3 0.15 0.74 0.04 0.10 0.20 ...

4 -0.45 -1.03 -0.79 -0.56 -0.32 ...

5 -0.06 1.06 1.35 1.09 -1.09 ...

Genes

Gene expression level of gene i in mRNA sample j

Log (Red intensity/Green intensity)

=

Log(Avg. PM - Avg. MM)

-2

2

Expression Vectors

Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types.

1.5

-0.8

1.8

0.5

-0.4

-1.3

1.5

0.8

Numeric Vector

Line Graph

Heatmap

Expression Vectors As Points in ‘Expression Space’

t 1

t 2

t 3

G1

-0.8

-0.3

-0.7

G2

-0.7

-0.8

-0.4

G3

Similar Expression

-0.4

-0.6

-0.8

G4

0.9

1.2

1.3

G5

1.3

0.9

-0.6

Experiment 3

Experiment 2

Experiment 1

Cluster Analysis
• Group a collection of objects into subsets or “clusters” such that objects within a cluster are closely related to one another than objects assigned to different clusters.
How can we do this?
• What is closely related?
• Distance or similarity metric
• What is close?
• Clustering algorithm
• How do we minimize distance between objects in a group while maximizing distances between groups?
Distance Metrics

(5.5,6)

• Euclidean Distance measures average distance
• Manhattan (City Block) measures average in each dimension
• Correlation measures difference with respect to linear trends

(3.5,4)

Gene Expression 2

Gene Expression 1

Clustering Gene Expression Data

Expression Measurements

• Cluster across the rows, group genes together that behave similarly across different conditions.
• Cluster across the columns, group different conditions together that behave similarly across most genes.

j

Genes

i

Clustering Time Series Data
• Measure gene expression on consecutive days
• Gene Measurement matrix
• G1= [1.2 4.0 5.0 1.0]
• G2= [2.0 2.5 5.5 6.0]
• G3= [4.5 3.0 2.5 1.0]
• G4= [3.5 1.5 1.2 1.5]
Euclidean Distance
• Distance is the square root of the sum of the squared distance between coordinates
City Block or Manhattan Distance
• G1= [1.2 4.0 5.0 1.0]
• G2= [2.0 2.5 5.5 6.0]
• G3= [4.5 3.0 2.5 1.0]
• G4= [3.5 1.5 1.2 1.5]
• Distance is the sum of the absolute value between coordinates
Correlation Distance
• Pearson correlation measures the degree of linear relationship between variables, [-1,1]
• Distance is 1-(pearson correlation), range of [0,2]
Similarity Measurements
• Pearson Correlation

Two profiles (vectors)

and

+1  Pearson Correlation  – 1

Similarity Measurements
• Cosine Correlation

+1  Cosine Correlation  – 1

Hierarchical Clustering
• IDEA: Iteratively combines genes into groups based on similar patterns of observed expression
• By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships.
• Display the data as a heatmap and dendrogram
• Cluster genes, samples or both

(HCL-1)

Hierarchical Clustering

Venn Diagram of Clustered Data

Dendrogram

Hierarchical clustering
• Merging (agglomerative): start with every measurement as a separate cluster then combine
• Splitting: make one large cluster, then split up into smaller pieces
• What is the distance between two clusters?
Distance between clusters
• Single-link: distance is the shortest distance from any member of one cluster to any member of the other cluster
• Complete link: distance is the longest distance from any member of one cluster to any member of the other cluster
• Average: Distance between the average of all points in each cluster
• Ward: minimizes the sum of squares of any two clusters
Hierarchical Clustering-Merging
• Euclidean distance

Distance between clusters when combined

Gene expression time series

Manhattan Distance

Distance between clusters when combined

Gene expression time series

Data Standardization
• Data points are normalized with respect to mean and variance, “sphering” the data
• After sphering, Euclidean and correlation distance are equivalent
• Standardization makes sense if you are not interested in the size of the effects, but in the effect itself
• Results are misleading for noisy data
• Every clustering method is based SOLELY on the measure of distance or similarity
• E.G. Correlation: measures linear association between two genes
• What if data are not properly transformed?
• Even good data can be ruined with the wrong choice of distance metric

A

B

C

D

Hierarchical Clustering

Initial Data Items

Distance Matrix

A

B

C

D

Hierarchical Clustering

Initial Data Items

Distance Matrix

A

B

C

D

Hierarchical Clustering

Current Clusters

Distance Matrix

2

A

B

C

D

Hierarchical Clustering

Current Clusters

Distance Matrix

A

B

C

D

Hierarchical Clustering

Current Clusters

Distance Matrix

A

B

C

D

Hierarchical Clustering

Current Clusters

Distance Matrix

3

A

B

C

D

Hierarchical Clustering

Current Clusters

Distance Matrix

A

B

C

D

Hierarchical Clustering

Current Clusters

Distance Matrix

A

B

C

D

Hierarchical Clustering

Current Clusters

Distance Matrix

10

A

B

C

D

Hierarchical Clustering

Final Result

Distance Matrix

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering
Hierarchical Clustering

Samples

Genes

The Leaf Ordering Problem:

• Find ‘optimal’ layout of branches for a given dendrogram architecture
• 2N-1 possible orderings of the branches
• For a small microarray dataset of 500 genes, there are 1.6*E150 branch configurations
Hierarchical Clustering

The Leaf Ordering Problem:

Hierarchical Clustering
• Pros:
• Commonly used algorithm
• Simple and quick to calculate
• Cons:
• Real genes probably do not have a hierarchical organization
Using Hierarchical Clustering
• Choose what samples and genes to use in your analysis
• Choose similarity/distance metric
• Choose clustering direction
• Calculate the dendrogram
• Choose height/number of clusters for interpretation
• Assess results
• Interpret cluster structure
Choose what samples/genes to include
• Very important step
• Do you want to include housekeeping genes or genes that didn’t change in your results?
• How do you handle replicates from the same sample?
• Noisy samples?
• Dendrogram is a mess if everything is included in large datasets
• Gene screening
2. Choose distance metric
• Metric should be a valid measure of the distance/similarity of genes
• Examples
• Applying Euclidean distance to categorical data is invalid
• Correlation metric applied to highly skewed data will give misleading results
3. Choose clustering direction
• Merging clustering (bottom up)
• Divisive
• split so that genes in the two clusters are the most similar, maximize distance between clusters
NearestNeighborAlgorithm
• Nearest Neighbor Algorithm is an agglomerative approach (bottom-up).
• Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached.

Hierarchical Clustering

Calculate the similarity between all possible combinations of two profiles

• Keys
• Similarity
• Clustering

Two most similar clusters are grouped together to form a new cluster

Calculate the similarity between the new cluster and all remaining clusters.

Hierarchical Clustering

C1

Merge which pair of clusters?

C2

C3

Hierarchical Clustering

Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters

+

+

C2

C1

Tend to generate “long chains”

Hierarchical Clustering

Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters

+

+

C2

C1

Tend to generate “clumps”

Hierarchical Clustering

Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).

+

+

C2

C1

Hierarchical Clustering

Dissimilarity between two clusters = Distance between two cluster means.

+

+

C2

C1

Which one?
• Both methods are “step-wise” optimal, at each step the optimal split or merge is performed
• Doesn’t mean that the final result is optimal
• Merging:
• Computationally simple
• Precise at bottom of tree
• Good for many small clusters
• Divisive
• More complex, but more precise at the top of the tree
• Good for looking at large and/or few clusters
• For Gene expression applications, divisive makes more sense