Microarray analysis

Microarray analysis Quantitation of Gene Expression Expression Data to Networks Reading: Ch 16 BIO520 Bioinformatics Jim Lund

Microarray data • Image quantitation. • Normalization • Find genes with significant expression differences • Annotation • Clustering, pattern analysis, network analysis

Sources of Non-Biological Variation • Dye bias: differences in heat and light sensitivity, efficiency of dye incorporation • Differences in the amount of labeled cDNA hybridized to each channel in a microarray experiment (Channel is used to refer to a combination of a dye and a slide.) • Variation across replicate slides • Variation across hybridization conditions • Variation in scanning conditions • Variation among technicians doing the lab work.

Factors which impact on the signal level • Amount of mRNA • Labeling efficiencies • Quality of the RNA • Laser/dye combination • Detection efficiency of photomultiplier or CCD

Hela HepG2

M vs. A Plot M =Log (Red -Log Green A = (Log Green+Log Red) / 2

M v A plots of chip pairs: before normalization

M v A plots of chip pairs: after quantile normalization

Types of normalization • To total signal (linear normalization) • LOESS (LOcally WEighted polynomial regreSSion). • To “house keeping genes” • To genomic DNA spots (Research Genetics) or mixed cDNA’s • To internal spikes

Microarray analysis • Data exploration: expression of gene X? • Statistical analysis: which genes show large, reproducible changes? • Clustering: grouping genes by expression pattern. • Knowledge-based analysis: Are amine synthesis genes involved in this experiment?

Fold change: the crudest method of finding differentially expressed genes Hela HepG2 >2-fold expression change >2-fold expression change

Distribution of measurements for gene of interest Probability of a given Value of the ratio What do we mean by differentially expressed? • Statistically, our gene is different from the other genes. Distribution of average ratios for all genes Number of genes Log ratio

Probe Signal Sample A Sample B Finding differentially expressed genesWhat affects our certainty that a gene is up or down-regulated? • Number of sample points • Difference in means • Standard deviations of sample

Practical views on statistics • With appropriate biological replicates, it is possible to select statistically meaningful genes/patterns. • Sensitivity and selectivity are inversely related - e.g. increased selection of true positives WILL result in more false positive and less false negatives. • False negatives are lost opportunities, false positives cost $’s and waste time. • A typical set of experiments treated with conservative statistics typically results in more genes/pathways/patterns than one can sensibly follow - so use conservative statistics to protect against false positives when designing follow-on experiments.

Statistical Tests • Student’s t-test • Correct for multiple testing! (Holm-Bonferroni) • False discovery rate. • Significance Analysis of Microarrays (SAM) • http://www-stat.stanford.edu/~tibs/SAM/ • ANOVA • Principal components analysis • Special methods for periodic patterns in data.

Volcano plot: log(expr) vs p-value p-value Log(fold change)

Scatter plot showing genes with significant p-values

Pattern finding • In many cases, the patterns of differential expression are the target (as opposed to specific genes) • Clustering or other approaches for pattern identification - find genes which behave similarly across all experiments or experiments which behave similarly across all genes • Classification - identify genes which best distinguish 2 or more classes. • The statistical reliability of the pattern or classifier is still an issue and similar considerations apply - e.g. cluster analysis of random noise will produce clusters which will be meaningless….

What is clustering? • Group similar objects together. • Genes with similar expression patterns. • Objects in the same cluster (group) are more similar to each other than objects in different clusters.

Clustering • What is clustering? • Similarity/distance metrics • Hierarchical clustering algorithms • Made popular by Stanford, ie. [Eisen et al. 1998] • K-means • Made popular by many groups, eg. [Tavazoie et al. 1999] • Self-organizing map (SOM) • Made popular by Whitehead, ie. [Tamayo et al. 1999]

Typical Tools • SAM (Significance Analysis of Microarrays), Stanford • GeneSpring • Affymetrix GeneChip Operating System (GCOS) • Cluster/Treeview • R statistics package microarray analysis libraries.

How to define similarity? Experiments X genes n 1 p 1 X • Similarity metric: • A measure of pairwise similarity or dissimilarity • Examples: • Correlation coefficient • Euclidean distance genes genes Y Y n n Raw matrix Similarity matrix

Similarity metrics • Euclidean distance • Correlation coefficient Euclidean clustering = magnitude & Direction Correlation clustering = direction

Sporulation-example

Self-organizing maps (SOM) [Kohonen 1995] • Basic idea: • map high dimensional data onto a 2D grid of nodes • Neighboring nodes are more similar than points far away

Self-organizing maps (SOM)

SOM Clusters

Things learned from from microarray gene expression experiments • Pathways not known to be involved • Ontology? • Novel genes involved in a known pathway • “like” and “unlike” tissues

Transcription FactorsRegulatory Networks • Identify co-regulated genes • Search for common motifs (transcription factor binding sites) • Evaluate known motifs/factors • Search for new ones. • Programs: MEME, etc.

mRNA-protein Correlation • YPD: should have relevant data • will yeast be typical? • Electrophoresis 18:533 • 23 proteins on 2D gels • r=0.48 for mRNA=protein • Post transcriptional and post translational regulation important!

Other microarray formats • Single nucleotide polymorphism (SNP) chips • Oligos with each of 4 nt at each SNP. • Chromosomal IP chips (ChIP:chip) • Determine transcription factor binding sites • Promoter DNA on the chip. • Alternative splicing chips • Long oligos, covering alternatively spliced exons, or all exons. • Genome tiling chips

ChIP:chip--Identification of Transcription Factor Binding Sites • Cross link transcription factors to DNA with formaldehyde • Pull out transcription factor of interest via immunoprecipitation with an antibody or by tagging the factor of interest with an isolatable epitope (e.g GST fusion). • Fractionate the DNA associated with the transcription factor, reverse the cross links, label and hybridize to an array of protomer DNA. • Brown et.al. (2001) Nature, 409(533-8)

ChIP:chipAnalysis of TF Binding Sites

On to Proteomics DNARNA Protein

Microarray analysis