Microarray data analysis. Jeremy Glasner Genetics 875 November 29, 2007. Why cluster data?. TMI- can’t “see” patterns in data Reduce complexity in data sets Allow “visualization” of complex data. Preliminary questions you need to ask before you start clustering.
November 29, 2007
Determine which changes are significant:
Fixed cutoff (fold-change>4)
Replication allows assessment of variability
Common statistics such as the t-test are often used for gene expression data. Significance of the value is then determined by referring to the t distribution. This assumes that the data is normally distributed, which may not be true.
Gene expression experiments may require thousands of statistical tests and significance should be adjusted to reflect this. A standard Bonferroni correction is the p-value multiplied by the number of tests but is likely too conservative.
Definition: Principle Components - A set of variables that define a projection that encapsulates the maximum amount of variation in a dataset and is orthogonal (and therefore uncorrelated) to the previous principle component of the same dataset.
Unsupervised learning -The classes are unknown a priori and need to be “discovered” from the data, e.g. cluster analysis, class discovery, unsupervised pattern recognition
Supervised learning-The classes are predefined and the task is to understand the basis for the classification from a set of labeled objects. This information is then used to classify future observations, e.g. classification, discriminant analysis, class prediction, supervised,pattern recognition
From: Eisen MB, Spellman PT, Brown PO and Botstein D. (1998). Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U S A 95, 14863-8.
Differentially expressed genes varying in the same way
From Eisen et al., PNAS 95:14863
K-means clustering proceeds by repeated application of a three-step process where:
1) the mean vector for all items in each cluster is computed
2) items are reassigned to the cluster whose center is closest to the item
The parameters controlling k-means clustering are:
1) the number of clusters (K)
2) the maximum number of cycles
Cluster visualized as a line graph of expression profiles
fatty acid oxidation
Log2 signal intensity
log phase 1hr 2hr 3hr 6hr 10hr
Figure From: Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 1999 Mar 16;96(6):2907-12.
Kim et al., A Gene Expression Map for Caenorhabditis elegans Science 14 September 2001
Development 130, 1621-1634 (2003)
David J. Lockhart & Elizabeth A. Winzeler. NATURE VOL 405 15 JUNE 2000
A form of artificial intelligence that is used to classify objects into known groups.
Given a set of patients with a disease and a collection of gene expression profiles we could try to train a model on the known cases and try to predict the disease in samples where it is unknown using our model.
Training examples are essential for these methods.
Transcription factor binding site identification by gene expression analysis
Typically examine expression in a mutant that under or overproduces a transcriptional regulator.
Potential targets of the regulator are identified by finding significant differences in gene expression between the mutant and wild-type.
Upstream regions of the sequence are searched for over-represented sequences (motifs) usually using a Gibbs sampling approach.
Once motifs are identified a matrix describing the motif can be used to search the genome for additional potential site.