Microarray normalization, error models, quality. Wolfgang Huber EMBL-EBI Brixen 16 June 2008. Oligonucleotide microarrays. Base Pairing. *. *. *. *. *. Oligonucleotide microarrays. GeneChip. Hybridized Probe Cell. Target - single stranded cDNA. Oligonucleotide probe. 5µm.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Brixen 16 June 2008
Probe: one of these 25-mer oligonucleotides
Perfect match (PM): A probe exactly complementary to the target sequence
Mismatch (MM): same as PM but with a single homomeric base change for the middle (13th) base (G-C, A-T)
Probe-pair: a (PM, MM) pair.
Probe set: a collection of probe-pairs (e.g. 11) targeting the same transcript
MGED/MIAME: „probe“ is ambiguous!
Reporter: the sequence
Feature: a physical patch on the array with molecules intended to have the same reporter sequence (one reporter can be represented by multiple features)Terminology for transcription arrays
From: lymphoma dataset
Alizadeh et al., Nature 2000
arrays / dyes
Cope et al. Bioinformatics 2003
Yue et al., (Incyte Genomics) NAR (2001) 29 e41
The problem is less that these steps are ‘not perfect’; it is that they vary from array to array, experiment to experiment.
Can always increase sensitivity on the cost of specificity, or vice versa, the art is to
- optimize both, then
- find the best trade-off.
How to address measurement uncertainty (“variance”)?
How to calibrate (“normalize”) for biases between samples?Questions
o similar effect on many
o corrections can be estimated from data
o too random to be ex-plicitely accounted for
o remain as “noise”
Error modelSources of variation
amount of RNA in the biopsy
probe purity and length distribution
spotting efficiency, spot size
nq = normalize.quantiles(exprs(Dilution))
nr = apply(exprs(Dilution), 2, rank)
for(i in 1:4) plot(nr[,i], nq[,i], pch=".", log="y", xlab="rank",
ylab="quantile normalized", main=sampleNames(Dilution)[i])
+ Always works, needs no user interaction / tuning
+ Non-parametric: can correct for quite nasty non-linearities (saturation, background) in the data
- Always "works", even if data are bad / inappropriate
- Conservative: rank transformation looses information - may yield less power to detect differentially expressed genes
- Aggressive: when there is an excess of up- (or down) regulated genes, it removes not just technical, but also biological variationQuantile normalisation
Local Regression and Likelihood
An algorithm for fitting this robustly is described (roughly) in the paper. They only provided software as a compiled binary for Windows. The method has not found much use.
The idea of the log-ratio (base 2)
0: no change
+1: up by factor of 21 = 2
+2: up by factor of 22 = 4
-1: down by factor of 2-1 = 1/2
-2: down by factor of 2-2 = ¼
A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.
What about a change from 0 to 500?
- noise, measurement precision
time in seconds
lengths in meters
energy in Joule, etc.
Climb Mount Plose (2465 m) from Brixen (559 m) with weight of 78 kg, working against a gravitation field of strength 9.81 m/s2 :What is wrong with microarray data?
(2465 - 559) · 78 · 9.81 m kg m/s2
= 1 458 433 kg m2 s-2
= 1 458.433 kJ
Trey Ideker et al. (2000) JCB
David Rocke and Blythe Durbin:
(2001) JCB, (2002) Bioinformatics
Use for robust affine regression normalisation:
W. Huber, Anja von Heydebreck et al. (2002) Bioinformatics
hik multiplicative noise
ai per-sample offset
eik additive noise
The two component model
measured intensity = offset + gain true abundance
two practically equivalent forms (h<<1)
Xu a family of random variables with E Xu = u, Var Xu = v(u). Define
Var f(Xu ) independent of u
derivation: linear approximation
- - - f(x) = log(x)
———hs(x) = asinh(x/s)
P. Munson, 2001
D. Rocke & B. Durbin, ISMB 2002
W. Huber et al., ISMB 2002
'glog' (generalized log-ratio)
c1, c2are experiment specific parameters (~level of background noise)
a general technology in statistics:
pay a small price in bias for a large decrease of variance, so overall the mean-squared-error (MSE) is reduced.
Particularly useful if you have few replicates.
Generalized log-ratio is a shrinkage estimator for fold change
Irizarry et al. Biostatistics 2003
raw intensities x
biased background correction
unbiased background correction
mismatch probe k for gene g on chip i
i = 1,…, n one to hundreds of chips
k = 1,…, J usually 11 probe pairs
g= 1,…, G tens of thousands of probe sets.
calibrate (normalize) the measurements from different chips (samples)
summarize for each probe set the probe level data, i.e., 11 PM and MM pairs, into a single expression measure.
compare between chips (samples) for detecting differential expression.Data and notation
o sort dk = PMk -MMk
o exclude highest and lowest value
o K := those pairs within 3 standard deviations of the averageExpression measures: MAS 4.0
biis estimated using the robust method median polish (successively remove row and column medians, accumulate terms, until convergence).
Expression measures RMA: Irizarry et al. (2002)
- trimmed mean (0.15)
Huber et al., Bioinformatics 2006
4 bp tiling path over complete genome
(12 M basepairs, 16 chromosomes)
Sense and Antisense strands
6.5 Mio oligonucleotides
5 mm feature size
manufactured by Affymetrix
designed by Lars Steinmetz (EMBL & Stanford Genome Center)
remove ‘dead’ probes
siprobe specific response factor.
Estimate taken from DNA hybridization data
bi =b(si )probe specific background term. Estimation: for strata of probes with similar si, estimate b through location estimator of distribution of intergenic probes, then interpolate to obtain continuous b(s)
Estimation of b: joint distribution of (DNA, RNA) values of intergenic PM probes
log2 RNA intensity
log2 DNA intensity
Variance stabilization applied to microarray data calibration and to the quantification of differential expression. W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, M. Vingron. Bioinformatics 18 suppl. 1 (2002), S96-S104.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. R. Irizarry, B. Hobbs, F. Collins, …, T. Speed. Biostatistics 4 (2003) 249-264.
Error models for microarray intensities. W. Huber, A. von Heydebreck, and M. Vingron. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).
Normalization and analysis of DNA microarray data by self-consistency and local regression. T.B. Kepler, L. Crosby, K. Morgan. Genome Biology. 3(7):research0037 (2002)
Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. S. Dudoit, Y.H. Yang, M. J. Callow, T. P. Speed. Technical report # 578, August 2000 (UC Berkeley Dep. Statistics)
A Benchmark for Affymetrix GeneChip Expression Measures. L.M. Cope, R.A. Irizarry, H. A. Jaffee, Z. Wu, T.P. Speed. Bioinformatics (2003).
....many, many more...References
Robert Gentleman (Seattle)
Günther Sawitzki (Heidelberg)
Martin Vingron (Berlin)
Rafael Irizarry (Baltimore)
Terry Speed (Berkeley)
Judith Boer (Leiden)
Anke Schroth (Wiesloch)
Friederike Wilmer (Hilden)
Jörn Tödling (Cambridge)
Lars Steinmetz (Heidelberg)
Audrey Kauffmann (Cambridge)Acknowledgements