microarray data analysis n.
Skip this Video
Loading SlideShow in 5 Seconds..
Microarray data analysis PowerPoint Presentation
Download Presentation
Microarray data analysis

Loading in 2 Seconds...

play fullscreen
1 / 26

Microarray data analysis - PowerPoint PPT Presentation

  • Uploaded on

Microarray data analysis. Jeremy Glasner Genetics 875 November 29, 2007. Why cluster data?. TMI- can’t “see” patterns in data Reduce complexity in data sets Allow “visualization” of complex data. Preliminary questions you need to ask before you start clustering.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Microarray data analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
microarray data analysis

Microarray data analysis

Jeremy Glasner

Genetics 875

November 29, 2007

why cluster data
Why cluster data?
  • TMI- can’t “see” patterns in data
  • Reduce complexity in data sets
  • Allow “visualization” of complex data
preliminary questions you need to ask before you start clustering
Preliminary questions you need to ask before you start clustering
  • What genes and experiments to cluster?
  • What normalization, standardization, or transformation should be applied to data?
  • What distance function should be used?
  • What clustering method should be used?

Cluster differentially expressed genes or all genes?

Determine which changes are significant:

Fixed cutoff (fold-change>4)

Replication allows assessment of variability

Common statistics such as the t-test are often used for gene expression data. Significance of the value is then determined by referring to the t distribution. This assumes that the data is normally distributed, which may not be true.

Gene expression experiments may require thousands of statistical tests and significance should be adjusted to reflect this. A standard Bonferroni correction is the p-value multiplied by the number of tests but is likely too conservative.

principle components analysis pca a k a svd
Principle Components Analysis (PCA, a.k.a. SVD)

Definition: Principle Components - A set of variables that define a projection that encapsulates the maximum amount of variation in a dataset and is orthogonal (and therefore uncorrelated) to the previous principle component of the same dataset.

  • With 1000 genes and 10 experiments we have either 1000 data points in 10-dimensional space or 10 data points in 1000-dimensional space
  • The data, though clumped around several central points in that hyperspace, will generally tend towards one direction. If one were to draw a solid line that best describes that direction, then that line is the first principle component (PC).
  • Any variation that is not captured by that first PC is captured by subsequent orthogonal PCs.
  • Singular Value Decomposition (SVD) is PCA using the covariance matrix of the data.



Pattern Discovery- assign objects to classes

Unsupervised learning -The classes are unknown a priori and need to be “discovered” from the data, e.g. cluster analysis, class discovery, unsupervised pattern recognition

Supervised learning-The classes are predefined and the task is to understand the basis for the classification from a set of labeled objects. This information is then used to classify future observations, e.g. classification, discriminant analysis, class prediction, supervised,pattern recognition

From: Eisen MB, Spellman PT, Brown PO and Botstein D. (1998). Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U S A 95, 14863-8.

different distance measures
Different distance measures
  • Euclidean distance- takes into account both the direction and the magnitude of the vectors
  • Manhattan distance- distance that is measured along directions that are parallel to the x and y axes meaning that there are no diagonal direction
more distance metrics
More distance metrics
  • Correlation distance
  • Chebychev distance
  • Angle between vectors
  • Squared Euclidean distance
  • Standardized Euclidean distance
  • Mahalanobis distance

Differentially expressed genes varying in the same way


Hierarchical clustering of expression data

From Eisen et al., PNAS 95:14863


K-means Clustering

K-means clustering proceeds by repeated application of a three-step process where:

1) the mean vector for all items in each cluster is computed

2) items are reassigned to the cluster whose center is closest to the item

3) repeat

The parameters controlling k-means clustering are:

1) the number of clusters (K)

2) the maximum number of cycles


Acetate utilization

prp genes

Cluster visualized as a line graph of expression profiles


fatty acid oxidation

Log2 signal intensity


log phase 1hr 2hr 3hr 6hr 10hr


Self-Organizing Maps

Figure From: Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 1999 Mar 16;96(6):2907-12.

3 d topo map of gene expression patterns
3-D Topo map of gene expression patterns

Kim et al., A Gene Expression Map for Caenorhabditis elegans Science 14 September 2001

  • Caenorhabditis elegans gene expression terrain map created by VxInsight at lowest resolution, showing three-dimensional representation of 44 gene mountains derived from 553 microarray hybridizations and consisting of 17,661 genes
  • correlations of gene expression profiles as distances in two dimensions and gene density in the third dimension
heat maps of mountains
Heat maps of mountains

Development 130, 1621-1634 (2003)


Combining results from different methods

David J. Lockhart & Elizabeth A. Winzeler. NATURE VOL 405 15 JUNE 2000


Mapping expression data onto metabolic pathways




Machine Learning

A form of artificial intelligence that is used to classify objects into known groups.

For example:

Given a set of patients with a disease and a collection of gene expression profiles we could try to train a model on the known cases and try to predict the disease in samples where it is unknown using our model.

Training examples are essential for these methods.

machine learning to predict regulatory states of genes
Machine learning to predict regulatory states of genes


general strategy for machine learning
General strategy for machine learning


a decision tree
A decision tree



Transcription factor binding site identification by gene expression analysis

Typically examine expression in a mutant that under or overproduces a transcriptional regulator.

Potential targets of the regulator are identified by finding significant differences in gene expression between the mutant and wild-type.

Upstream regions of the sequence are searched for over-represented sequences (motifs) usually using a Gibbs sampling approach.

Once motifs are identified a matrix describing the motif can be used to search the genome for additional potential site.