360 likes | 385 Views
Microarray analysis. Quantitation of Gene Expression Expression Data to Networks. Reading: Ch 16. BIO520 Bioinformatics Jim Lund. Microarray data. Image quantitation. Normalization Find genes with significant expression differences Annotation
E N D
Microarray analysis Quantitation of Gene Expression Expression Data to Networks Reading: Ch 16 BIO520 Bioinformatics Jim Lund
Microarray data • Image quantitation. • Normalization • Find genes with significant expression differences • Annotation • Clustering, pattern analysis, network analysis
Sources of Non-Biological Variation • Dye bias: differences in heat and light sensitivity, efficiency of dye incorporation • Differences in the amount of labeled cDNA hybridized to each channel in a microarray experiment (Channel is used to refer to a combination of a dye and a slide.) • Variation across replicate slides • Variation across hybridization conditions • Variation in scanning conditions • Variation among technicians doing the lab work.
Factors which impact on the signal level • Amount of mRNA • Labeling efficiencies • Quality of the RNA • Laser/dye combination • Detection efficiency of photomultiplier or CCD
Hela HepG2
Hela HepG2
M vs. A Plot M =Log (Red -Log Green A = (Log Green+Log Red) / 2
Types of normalization • To total signal (linear normalization) • LOESS (LOcally WEighted polynomial regreSSion). • To “house keeping genes” • To genomic DNA spots (Research Genetics) or mixed cDNA’s • To internal spikes
Microarray analysis • Data exploration: expression of gene X? • Statistical analysis: which genes show large, reproducible changes? • Clustering: grouping genes by expression pattern. • Knowledge-based analysis: Are amine synthesis genes involved in this experiment?
Fold change: the crudest method of finding differentially expressed genes Hela HepG2 >2-fold expression change >2-fold expression change
Distribution of measurements for gene of interest Probability of a given Value of the ratio What do we mean by differentially expressed? • Statistically, our gene is different from the other genes. Distribution of average ratios for all genes Number of genes Log ratio
Probe Signal Sample A Sample B Finding differentially expressed genesWhat affects our certainty that a gene is up or down-regulated? • Number of sample points • Difference in means • Standard deviations of sample
Practical views on statistics • With appropriate biological replicates, it is possible to select statistically meaningful genes/patterns. • Sensitivity and selectivity are inversely related - e.g. increased selection of true positives WILL result in more false positive and less false negatives. • False negatives are lost opportunities, false positives cost $’s and waste time. • A typical set of experiments treated with conservative statistics typically results in more genes/pathways/patterns than one can sensibly follow - so use conservative statistics to protect against false positives when designing follow-on experiments.
Statistical Tests • Student’s t-test • Correct for multiple testing! (Holm-Bonferroni) • False discovery rate. • Significance Analysis of Microarrays (SAM) • http://www-stat.stanford.edu/~tibs/SAM/ • ANOVA • Principal components analysis • Special methods for periodic patterns in data.
Volcano plot: log(expr) vs p-value p-value Log(fold change)
Pattern finding • In many cases, the patterns of differential expression are the target (as opposed to specific genes) • Clustering or other approaches for pattern identification - find genes which behave similarly across all experiments or experiments which behave similarly across all genes • Classification - identify genes which best distinguish 2 or more classes. • The statistical reliability of the pattern or classifier is still an issue and similar considerations apply - e.g. cluster analysis of random noise will produce clusters which will be meaningless….
What is clustering? • Group similar objects together. • Genes with similar expression patterns. • Objects in the same cluster (group) are more similar to each other than objects in different clusters.
Clustering • What is clustering? • Similarity/distance metrics • Hierarchical clustering algorithms • Made popular by Stanford, ie. [Eisen et al. 1998] • K-means • Made popular by many groups, eg. [Tavazoie et al. 1999] • Self-organizing map (SOM) • Made popular by Whitehead, ie. [Tamayo et al. 1999]
Typical Tools • SAM (Significance Analysis of Microarrays), Stanford • GeneSpring • Affymetrix GeneChip Operating System (GCOS) • Cluster/Treeview • R statistics package microarray analysis libraries.
How to define similarity? Experiments X genes n 1 p 1 X • Similarity metric: • A measure of pairwise similarity or dissimilarity • Examples: • Correlation coefficient • Euclidean distance genes genes Y Y n n Raw matrix Similarity matrix
Similarity metrics • Euclidean distance • Correlation coefficient Euclidean clustering = magnitude & Direction Correlation clustering = direction
Self-organizing maps (SOM) [Kohonen 1995] • Basic idea: • map high dimensional data onto a 2D grid of nodes • Neighboring nodes are more similar than points far away
Things learned from from microarray gene expression experiments • Pathways not known to be involved • Ontology? • Novel genes involved in a known pathway • “like” and “unlike” tissues
Transcription FactorsRegulatory Networks • Identify co-regulated genes • Search for common motifs (transcription factor binding sites) • Evaluate known motifs/factors • Search for new ones. • Programs: MEME, etc.
mRNA-protein Correlation • YPD: should have relevant data • will yeast be typical? • Electrophoresis 18:533 • 23 proteins on 2D gels • r=0.48 for mRNA=protein • Post transcriptional and post translational regulation important!
Other microarray formats • Single nucleotide polymorphism (SNP) chips • Oligos with each of 4 nt at each SNP. • Chromosomal IP chips (ChIP:chip) • Determine transcription factor binding sites • Promoter DNA on the chip. • Alternative splicing chips • Long oligos, covering alternatively spliced exons, or all exons. • Genome tiling chips
ChIP:chip--Identification of Transcription Factor Binding Sites • Cross link transcription factors to DNA with formaldehyde • Pull out transcription factor of interest via immunoprecipitation with an antibody or by tagging the factor of interest with an isolatable epitope (e.g GST fusion). • Fractionate the DNA associated with the transcription factor, reverse the cross links, label and hybridize to an array of protomer DNA. • Brown et.al. (2001) Nature, 409(533-8)
On to Proteomics DNARNA Protein