slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering Algorithms PowerPoint Presentation
Download Presentation
Clustering Algorithms

Loading in 2 Seconds...

play fullscreen
1 / 67

Clustering Algorithms - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS. Clustering Algorithms.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Clustering Algorithms' - nolcha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

CZ5225: Modeling and Simulation in BiologyLecture 3: Clustering Analysis for Microarray Data IProf. Chen Yu ZongTel: 6874-6877Email: yzchen@cz3.nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUS

clustering algorithms
Clustering Algorithms
  • Be weary - confounding computational artifacts are associated with all clustering algorithms. -You should always understand the basic concepts behind an algorithm before using it.
  • Anything will cluster! Garbage In means Garbage Out.
supervised vs unsupervised learning
Supervised vs. Unsupervised Learning
  • Supervised: there is a teacher, class labels are known
      • Support vector machines
      • Backpropagation neural networks
  • Unsupervised: No teacher, class labels are unknown
      • Clustering
      • Self-organizing maps
gene expression data
Gene Expression Data

Gene expression data on p genes for n samples

mRNA samples

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...

2 -0.10 0.49 0.24 0.06 0.46 ...

3 0.15 0.74 0.04 0.10 0.20 ...

4 -0.45 -1.03 -0.79 -0.56 -0.32 ...

5 -0.06 1.06 1.35 1.09 -1.09 ...

Genes

Gene expression level of gene i in mRNA sample j

Log (Red intensity/Green intensity)

=

Log(Avg. PM - Avg. MM)

expression vectors

-2

2

Expression Vectors

Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types.

1.5

-0.8

1.8

0.5

-0.4

-1.3

1.5

0.8

Numeric Vector

Line Graph

Heatmap

expression vectors as points in expression space
Expression Vectors As Points in ‘Expression Space’

t 1

t 2

t 3

G1

-0.8

-0.3

-0.7

G2

-0.7

-0.8

-0.4

G3

Similar Expression

-0.4

-0.6

-0.8

G4

0.9

1.2

1.3

G5

1.3

0.9

-0.6

Experiment 3

Experiment 2

Experiment 1

cluster analysis
Cluster Analysis
  • Group a collection of objects into subsets or “clusters” such that objects within a cluster are closely related to one another than objects assigned to different clusters.
how can we do this
How can we do this?
  • What is closely related?
      • Distance or similarity metric
      • What is close?
  • Clustering algorithm
      • How do we minimize distance between objects in a group while maximizing distances between groups?
distance metrics
Distance Metrics

(5.5,6)

  • Euclidean Distance measures average distance
  • Manhattan (City Block) measures average in each dimension
  • Correlation measures difference with respect to linear trends

(3.5,4)

Gene Expression 2

Gene Expression 1

clustering gene expression data
Clustering Gene Expression Data

Expression Measurements

  • Cluster across the rows, group genes together that behave similarly across different conditions.
  • Cluster across the columns, group different conditions together that behave similarly across most genes.

j

Genes

i

clustering time series data
Clustering Time Series Data
  • Measure gene expression on consecutive days
  • Gene Measurement matrix
  • G1= [1.2 4.0 5.0 1.0]
  • G2= [2.0 2.5 5.5 6.0]
  • G3= [4.5 3.0 2.5 1.0]
  • G4= [3.5 1.5 1.2 1.5]
euclidean distance
Euclidean Distance
  • Distance is the square root of the sum of the squared distance between coordinates
city block or manhattan distance
City Block or Manhattan Distance
  • G1= [1.2 4.0 5.0 1.0]
  • G2= [2.0 2.5 5.5 6.0]
  • G3= [4.5 3.0 2.5 1.0]
  • G4= [3.5 1.5 1.2 1.5]
  • Distance is the sum of the absolute value between coordinates
correlation distance
Correlation Distance
  • Pearson correlation measures the degree of linear relationship between variables, [-1,1]
  • Distance is 1-(pearson correlation), range of [0,2]
similarity measurements
Similarity Measurements
  • Pearson Correlation

Two profiles (vectors)

and

+1  Pearson Correlation  – 1

similarity measurements1
Similarity Measurements
  • Cosine Correlation

+1  Cosine Correlation  – 1

hierarchical clustering
Hierarchical Clustering
  • IDEA: Iteratively combines genes into groups based on similar patterns of observed expression
  • By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships.
  • Display the data as a heatmap and dendrogram
  • Cluster genes, samples or both

(HCL-1)

hierarchical clustering1
Hierarchical Clustering

Venn Diagram of Clustered Data

Dendrogram

hierarchical clustering2
Hierarchical clustering
  • Merging (agglomerative): start with every measurement as a separate cluster then combine
  • Splitting: make one large cluster, then split up into smaller pieces
  • What is the distance between two clusters?
distance between clusters
Distance between clusters
  • Single-link: distance is the shortest distance from any member of one cluster to any member of the other cluster
  • Complete link: distance is the longest distance from any member of one cluster to any member of the other cluster
  • Average: Distance between the average of all points in each cluster
  • Ward: minimizes the sum of squares of any two clusters
hierarchical clustering merging
Hierarchical Clustering-Merging
  • Euclidean distance
  • Average linking

Distance between clusters when combined

Gene expression time series

manhattan distance
Manhattan Distance

Distance between clusters when combined

  • Average linking

Gene expression time series

data standardization
Data Standardization
  • Data points are normalized with respect to mean and variance, “sphering” the data
  • After sphering, Euclidean and correlation distance are equivalent
  • Standardization makes sense if you are not interested in the size of the effects, but in the effect itself
  • Results are misleading for noisy data
distance comments
Distance Comments
  • Every clustering method is based SOLELY on the measure of distance or similarity
  • E.G. Correlation: measures linear association between two genes
      • What if data are not properly transformed?
      • What about outliers?
      • What about saturation effects?
  • Even good data can be ruined with the wrong choice of distance metric
slide26

A

B

C

D

Hierarchical Clustering

Initial Data Items

Distance Matrix

slide27

A

B

C

D

Hierarchical Clustering

Initial Data Items

Distance Matrix

slide28

A

B

C

D

Hierarchical Clustering

Single Linkage

Current Clusters

Distance Matrix

2

slide29

A

B

C

D

Hierarchical Clustering

Single Linkage

Current Clusters

Distance Matrix

slide30

A

B

C

D

Hierarchical Clustering

Single Linkage

Current Clusters

Distance Matrix

slide31

A

B

C

D

Hierarchical Clustering

Single Linkage

Current Clusters

Distance Matrix

3

slide32

A

B

C

D

Hierarchical Clustering

Single Linkage

Current Clusters

Distance Matrix

slide33

A

B

C

D

Hierarchical Clustering

Single Linkage

Current Clusters

Distance Matrix

slide34

A

B

C

D

Hierarchical Clustering

Single Linkage

Current Clusters

Distance Matrix

10

slide35

A

B

C

D

Hierarchical Clustering

Single Linkage

Final Result

Distance Matrix

hierarchical clustering3

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering
hierarchical clustering4

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering
hierarchical clustering5

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering
hierarchical clustering6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering
hierarchical clustering7

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering
hierarchical clustering8

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering
hierarchical clustering9

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering
hierarchical clustering10

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8

Hierarchical Clustering
hierarchical clustering12
Hierarchical Clustering

Samples

Genes

The Leaf Ordering Problem:

  • Find ‘optimal’ layout of branches for a given dendrogram architecture
  • 2N-1 possible orderings of the branches
  • For a small microarray dataset of 500 genes, there are 1.6*E150 branch configurations
hierarchical clustering13
Hierarchical Clustering

The Leaf Ordering Problem:

hierarchical clustering14
Hierarchical Clustering
  • Pros:
    • Commonly used algorithm
    • Simple and quick to calculate
  • Cons:
    • Real genes probably do not have a hierarchical organization
using hierarchical clustering
Using Hierarchical Clustering
  • Choose what samples and genes to use in your analysis
  • Choose similarity/distance metric
  • Choose clustering direction
  • Choose linkage method
  • Calculate the dendrogram
  • Choose height/number of clusters for interpretation
  • Assess results
  • Interpret cluster structure
choose what samples genes to include
Choose what samples/genes to include
  • Very important step
  • Do you want to include housekeeping genes or genes that didn’t change in your results?
  • How do you handle replicates from the same sample?
  • Noisy samples?
  • Dendrogram is a mess if everything is included in large datasets
  • Gene screening
2 choose distance metric
2. Choose distance metric
  • Metric should be a valid measure of the distance/similarity of genes
  • Examples
    • Applying Euclidean distance to categorical data is invalid
    • Correlation metric applied to highly skewed data will give misleading results
3 choose clustering direction
3. Choose clustering direction
  • Merging clustering (bottom up)
  • Divisive
    • split so that genes in the two clusters are the most similar, maximize distance between clusters
nearest neighbor algorithm
NearestNeighborAlgorithm
  • Nearest Neighbor Algorithm is an agglomerative approach (bottom-up).
  • Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached.
slide61

Hierarchical Clustering

Calculate the similarity between all possible combinations of two profiles

  • Keys
  • Similarity
  • Clustering

Two most similar clusters are grouped together to form a new cluster

Calculate the similarity between the new cluster and all remaining clusters.

hierarchical clustering15
Hierarchical Clustering

C1

Merge which pair of clusters?

C2

C3

slide63

Hierarchical Clustering

Single Linkage

Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters

+

+

C2

C1

Tend to generate “long chains”

slide64

Hierarchical Clustering

Complete Linkage

Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters

+

+

C2

C1

Tend to generate “clumps”

slide65

Hierarchical Clustering

Average Linkage

Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).

+

+

C2

C1

slide66

Hierarchical Clustering

Average Group Linkage

Dissimilarity between two clusters = Distance between two cluster means.

+

+

C2

C1

which one
Which one?
  • Both methods are “step-wise” optimal, at each step the optimal split or merge is performed
  • Doesn’t mean that the final result is optimal
  • Merging:
      • Computationally simple
      • Precise at bottom of tree
      • Good for many small clusters
  • Divisive
      • More complex, but more precise at the top of the tree
      • Good for looking at large and/or few clusters
  • For Gene expression applications, divisive makes more sense