- By
**ronda** - Follow User

- 175 Views
- Uploaded on

Download Presentation
## Microarray data analysis

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Why cluster data?

- TMI- can’t “see” patterns in data
- Reduce complexity in data sets
- Allow “visualization” of complex data

Preliminary questions you need to ask before you start clustering

- What genes and experiments to cluster?
- What normalization, standardization, or transformation should be applied to data?
- What distance function should be used?
- What clustering method should be used?

Cluster differentially expressed genes or all genes?

Determine which changes are significant:

Fixed cutoff (fold-change>4)

Replication allows assessment of variability

Common statistics such as the t-test are often used for gene expression data. Significance of the value is then determined by referring to the t distribution. This assumes that the data is normally distributed, which may not be true.

Gene expression experiments may require thousands of statistical tests and significance should be adjusted to reflect this. A standard Bonferroni correction is the p-value multiplied by the number of tests but is likely too conservative.

Principle Components Analysis (PCA, a.k.a. SVD)

Definition: Principle Components - A set of variables that define a projection that encapsulates the maximum amount of variation in a dataset and is orthogonal (and therefore uncorrelated) to the previous principle component of the same dataset.

- With 1000 genes and 10 experiments we have either 1000 data points in 10-dimensional space or 10 data points in 1000-dimensional space
- The data, though clumped around several central points in that hyperspace, will generally tend towards one direction. If one were to draw a solid line that best describes that direction, then that line is the first principle component (PC).
- Any variation that is not captured by that first PC is captured by subsequent orthogonal PCs.
- Singular Value Decomposition (SVD) is PCA using the covariance matrix of the data.

http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/PCA_1.htm

Pattern Discovery- assign objects to classes

Unsupervised learning -The classes are unknown a priori and need to be “discovered” from the data, e.g. cluster analysis, class discovery, unsupervised pattern recognition

Supervised learning-The classes are predefined and the task is to understand the basis for the classification from a set of labeled objects. This information is then used to classify future observations, e.g. classification, discriminant analysis, class prediction, supervised,pattern recognition

From: Eisen MB, Spellman PT, Brown PO and Botstein D. (1998). Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U S A 95, 14863-8.

Different distance measures

- Euclidean distance- takes into account both the direction and the magnitude of the vectors
- Manhattan distance- distance that is measured along directions that are parallel to the x and y axes meaning that there are no diagonal direction

More distance metrics

- Correlation distance
- Chebychev distance
- Angle between vectors
- Squared Euclidean distance
- Standardized Euclidean distance
- Mahalanobis distance

Differentially expressed genes varying in the same way

Hierarchical clustering of expression data

From Eisen et al., PNAS 95:14863

K-means clustering proceeds by repeated application of a three-step process where:

1) the mean vector for all items in each cluster is computed

2) items are reassigned to the cluster whose center is closest to the item

3) repeat

The parameters controlling k-means clustering are:

1) the number of clusters (K)

2) the maximum number of cycles

prp genes

Cluster visualized as a line graph of expression profiles

propionate

fatty acid oxidation

Log2 signal intensity

fad

log phase 1hr 2hr 3hr 6hr 10hr

Figure From: Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 1999 Mar 16;96(6):2907-12.

3-D Topo map of gene expression patterns

Kim et al., A Gene Expression Map for Caenorhabditis elegans Science 14 September 2001

- Caenorhabditis elegans gene expression terrain map created by VxInsight at lowest resolution, showing three-dimensional representation of 44 gene mountains derived from 553 microarray hybridizations and consisting of 17,661 genes
- correlations of gene expression profiles as distances in two dimensions and gene density in the third dimension

Heat maps of mountains

Development 130, 1621-1634 (2003)

Combining results from different methods

David J. Lockhart & Elizabeth A. Winzeler. NATURE VOL 405 15 JUNE 2000

Mapping expression data onto metabolic pathways

http://www.genome.ad.jp/kegg/

http://biocyc.org:1555/ECOLI/expression.html

A form of artificial intelligence that is used to classify objects into known groups.

For example:

Given a set of patients with a disease and a collection of gene expression profiles we could try to train a model on the known cases and try to predict the disease in samples where it is unknown using our model.

Training examples are essential for these methods.

Machine learning to predict regulatory states of genes

http://www.ebi.ac.uk/microarray/Research/networks/reconstruction.html#

General strategy for machine learning

http://www.ebi.ac.uk/microarray/Research/networks/reconstruction.html#

A decision tree

http://www.ebi.ac.uk/microarray/Research/networks/reconstruction.html#

Transcription factor binding site identification by gene expression analysis

Typically examine expression in a mutant that under or overproduces a transcriptional regulator.

Potential targets of the regulator are identified by finding significant differences in gene expression between the mutant and wild-type.

Upstream regions of the sequence are searched for over-represented sequences (motifs) usually using a Gibbs sampling approach.

Once motifs are identified a matrix describing the motif can be used to search the genome for additional potential site.

Download Presentation

Connecting to Server..