basic gene expression data analysis clustering n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Basic Gene Expression Data Analysis--Clustering PowerPoint Presentation
Download Presentation
Basic Gene Expression Data Analysis--Clustering

Loading in 2 Seconds...

play fullscreen
1 / 38

Basic Gene Expression Data Analysis--Clustering - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

Basic Gene Expression Data Analysis--Clustering. Pairwise Measures. Clustering. Motif Searching/Network Construction. Integrated Analysis (NMR/SNP/Clinic/….). cDNA. mRNA. Treated. RT and label with fluor dyes. Control. Mix and hybridize target to microarray. Spot (DNA probe):

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Basic Gene Expression Data Analysis--Clustering' - crescent


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
basic gene expression data analysis clustering
Basic Gene Expression Data Analysis--Clustering

Pairwise Measures

Clustering

Motif Searching/Network Construction

Integrated Analysis

(NMR/SNP/Clinic/….)

microarray experiment

cDNA

mRNA

Treated

RT and

label with

fluor dyes

Control

Mix and hybridize target to microarray

  • Spot (DNA probe):
  • known cDNA or
  • Oligo
MicroarrayExperiment
collections of experiments
Collections of Experiments
  • Time course after a treatment
  • Different treatments
  • Disease cell lines
  • Data are represented in a matrix
cluster analysis
Cluster Analysis
  • Grouping of genes with “similar” expression profiles
  • Grouping of disease cell lines/toxicants with “similar” effects on gene expression
  • Clustering algorithms
    • Hierarchical clustering
    • Self-organizing maps
    • K-means clustering
gene expression clustering
Gene Expression Clustering

Normalized Expression Data

Protein/protein complex

Semantics of clusters:

From co-expressed

to co-regulated

DNA regulatory elements

key terms in cluster analysis
Key Terms in Cluster Analysis
  • Distance & Similarity measures
  • Hierarchical & non-hierarchical
  • Single/complete/average linkage
  • Dendrograms & ordering
measuring similarity of gene expression
Measuring Similarity of Gene Expression

(x2,y2)

  • Euclidean (L2) distance
  • Manhattan (L1) distance
  • Lm: (|x1-x2|m+|y1-y2|m)1/m
  • L∞: max(|x1-x2|,|y1-y2|)
  • Inner product: x1x2+y1y2
  • Correlation coefficient
  • Spearman rank correlation coefficient

(x1, y1)

manhattan distance is called hamming distance when all features are binary
Manhattan distance is called Hamming distance when all features are binary.

Gene Expression Levels Under 17 Conditions (1-High,0-Low)

from clustering to correlation
From Clustering to Correlation

Expression Level

Expression Level

Gene A

Gene B

Gene B

Gene A

Time

Time

Expression Level

Gene B

Gene A

Time

hierarchical clustering
Hierarchical Clustering

Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process hierarchical clustering is this:

1.Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain.

2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.

3.Compute distances (similarities) between the new cluster and each of the old clusters.

4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

hierarchical clustering1
Hierarchical Clustering

Normalized Expression Data

hierarchical clustering2
Hierarchical Clustering

3 clusters?

2 clusters?

cluster analysis1

N genes

N*N correlation matrix

control

t2

t0

t1

...

time

Cluster Analysis
  • Correlation as measure of co-expression
  • Eisen et al. (1998) (PNAS, 95:14863)

Experiment over time

cluster analysis2

N*N correlation matrix

1

2

3

Cluster Analysis
  • Scan matrix for maximum
  • Join genes to 1 node
  • Update matrix
cluster analysis3
Cluster Analysis
  • Result: Dendogram assemling N genes
  • Points of discussion
    • similarity based, useful for co-expression
    • dependent on similarity measure?
    • useful in preliminary scans
    • biological relevance of clusters?
distance between two clusters

Average

distance

Min

distance

Max

distance

Distance Between Two Clusters
  • single-link clustering (also called the connectedness or minimum method) : we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster.
  • complete-link clustering (also called the diameter or maximum method):we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster.
  • average-link clustering :we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.
  • Single-Link Method / Nearest Neighbor
  • Complete-Link / Furthest Neighbor
  • Their Centroids.
  • Average of all cross-cluster pairs.
single link method
Single-Link Method

Euclidean Distance

a

a,b

b

a,b,c

a,b,c,d

c

d

c

d

d

(1)

(3)

(2)

Distance Matrix

complete link method
Complete-Link Method

Euclidean Distance

a

a,b

a,b

b

a,b,c,d

c,d

c

d

c

d

(1)

(3)

(2)

Distance Matrix

slide23

Identifying disease genes

Tumor Liver

Endothelial cells 1

Non-tumor Liver

Proliferation

Ribosomal proteins

Liver-specific

X. Chen & P.O. Brown et al

Molecular Biology of the Cell

Vol. 13, 1929-1939, June 2002

slide24
Human tumor patient and normal cells; various conditions
  • Cluster or Classify genes according to tumors
  • Cluster tumors according to genes
k means clustering algorithm
1) Select an initial partition of k clusters

2) Assign each object to the cluster with the closest center:

3) Compute the new centers of the clusters:

4) Repeat step 2 and 3 until no object changes cluster

K-Means Clustering Algorithm
k means clustering
Basic Ideas : using cluster centre (means) to represent cluster

Assigning data elements to the closet cluster (centre).

Goal: Minimise square error (intra-class dissimilarity) : =

Variations of K-Means

Initialisation (select the number of clusters, initial partitions)

Updating of center

Hill-climbing (trying to move an object to another cluster).

K-Means Clustering

This method initially takes the number of components of the population equal to the final required number of clusters. In this step itself the final required number of clusters is chosen such that the points are mutually farthest apart. Next, it examines each component in the population and assigns it to one of the clusters depending on the minimum distance. The centroid's position is recalculated everytime a component is added to the cluster and this continues until all the components are grouped into the final required number of clusters.

slide28

5000

2

k-means Clustering : Procedure (1)

Initialization 1

Specify the number of cluster k

: for example, k = 4

Expression matrix

Each point is called “gene”

slide29

k-means Clustering : Procedure (2)

Initialization 2

Genes are randomly assigned to one of k clusters

slide30

[(6,7) + (3,4) + …]

k-means Clustering : Procedure (2)

Calculate the mean of each cluster

(6,7)

(3,4)

(3,2)

(1,2)

slide31

Gene i to cluster c

k-means Clustering : Procedure (4)

Each gene is reassigned to the nearest cluster

slide32

Gene i to cluster c

k-means Clustering : Procedure (4)

Each gene is reassigned to the nearest cluster

slide33

k-means Clustering : Procedure (5)

Iterate until the means are converged

k means clustering application

6220 yeast genes

15 time points

during cell cycle

M/G1 phase

G1 phase

M phase

k-means clustering : application

:

Result

: 13 clusters of 30 clusters had statistical significance

for each biological function

S. Tavazoie & GM Church

Nature Genetics

Vol. 22, July 1999

computation time and memory requirement n genes and m experiments
Computation time:

Hierarchical clustering

O( m n2 log(n) )

K-means clustering

t: number of iterations

O( k t m n )

Memory requirement:

Hierarchical clustering

O( mn + n2 )

K-means clustering

t: number of iterations

O( mn + kn )

Computation Time and Memory Requirementn genes and m experiments
issues in cluster analysis
Issues in Cluster Analysis
  • A lot of clustering algorithms
  • A lot of distance/similarity metrics
  • Which clustering algorithm runs faster and uses less memory?
  • How many clusters after all?
  • Are the clusters stable?
  • Are the clusters meaningful?
pattern recognition
Pattern Recognition
  • Clarification of decision making processes and automating them using computers

supervised

unsupervised

  • unknown number of classes
  • known number of classes
  • based on a training set
  • no prior knowledge
  • used to classify future observations
  • cluster analysis = one form