Quality control and normalization. Wolfgang Huber European Bioinformatics Institute. Anja von Heydebreck (Darmstadt) Robert Gentleman (Seattle) Günther Sawitzki (Heidelberg) Martin Vingron (Berlin) Annemarie Poustka, Holger Sültmann, Andreas Buness, Markus Ruschhaupt (Heidelberg)
European Bioinformatics Institute
Anja von Heydebreck (Darmstadt)
Robert Gentleman (Seattle)
Günther Sawitzki (Heidelberg)
Martin Vingron (Berlin)
Annemarie Poustka, Holger Sültmann, Andreas Buness, Markus Ruschhaupt (Heidelberg)
Rafael Irizarry (Baltimore)
Judith Boer (Leiden)
Anke Schroth (Heidelberg)
Friederike Wilmer (Hilden)
Can always increase sensitivity on the cost of specificity,
or vice versa,
the art is to find the best trade-off.
(It can also be possible to increase both by better choice of method / model)
But what if the gene is “off” (below detection limit) in one condition?
Fold changes are useful to describe continuous changes in expression
The idea of the log-ratio (base 2)
0: no change
+1: up by factor of 21 = 2
+2: up by factor of 22 = 4
-1: down by factor of 2-1 = 1/2
-2: down by factor of 2-2 = ¼
A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.
What about a change from 0 to 500?
- noise, measurement precision
The problem is less that these steps are ‘not perfect’; it is that they vary from array to array, experiment to experiment.
How to compare microarray intensities with each other?
How to address measurement uncertainty (“variance”)?
How to calibrate (“normalize”) for biases between samples?
o similar effect on many
o corrections can be estimated from data
o too random to be ex-plicitely accounted for
o remain as “noise”
amount of RNA in the biopsy
probe purity and length distribution
spotting efficiency, spot size
hik ~ N(0,s22)
ai per-sample offset
eik ~ N(0, bi2s12)
The two component model
measured intensity = offset + gain true abundance
B. Durbin, D. Rocke, JCB 2001
two practically equivalent forms (h<<1)
Xu a family of random variables with EXu=u, VarXu=v(u). Define
var f(Xu ) independent of u
derivation: linear approximation
1.) constant variance (‘additive’)
2.) constant CV (‘multiplicative’)
4.) additive and multiplicative
- - - f(x) = log(x)
———hs(x) = asinh(x/s)
P. Munson, 2001
D. Rocke & B. Durbin, ISMB 2002
W. Huber et al., ISMB 2002
s: probe strata (e.g. print-tip, region)
'glog' (generalized log-ratio)
c1, c2are experiment specific parameters (~level of background noise)
pay a small price in bias for a large decrease of variance, so overall the mean-squared-error (MSE) is reduced.
Particularly useful if you have few replicates.
= a shrinkage estimator for fold change
There are many possible choices, we chose “variance-stabilization”:
+ interpretable even in cases where genes are off in some conditions
+ can subsequently use standard statistical methods (hypothesis testing, ANOVA, clustering, classification…) without the worries about low-level variability that are often warranted on the log-scale
o Microarrays can be operated in a linear regime, where fluorescence intensity increases proportionally to target abundance (see e.g. Affymetrix dilution series)
Two reasons for non-linearity:
oAt the high intensity end:saturation/quenching. This can and should be avoided experimentally - loss of data!
oAt the low intensity end:background offsets, instead of y=k·x we have y=k·x+x0, and in the log-log plot this can look curvilinear. But this is an affine-linear effect and can be correct by affine normalization. Non-parametric methods (e.g. loess) risk overfitting and loss of power.
1. correction for systematic experimental biases
2. provision of expression values that can subsequently be used for testing, clustering, classification, modelling…
3. provision of a measure of measurement uncertainty
Quality trade-off: the better the measurements, the less need for normalization. Need for “too much” normalization relates to a quality problem.
Variance-Bias trade-off: how do you weigh measurements that have low signal-noise ratio?
- just use anyway
Logarithm is more beautiful than arsinh
It takes forever to run method XX. Referees will only accept my paper if it uses the original MAS5.
The best method is that that makes all my scatterplots look like straight, slim cigars
Normalization calculations should be based on physical/chemical model
Life would be so much easier if everybody were just using the same method, who cares which one
Comparison against a ground truth
But you have millions of numbers – need to choose the metric that measures deviation from truth.
FN/FP: do you find all the differentially expressed genes, and do you not find non-d.e. genes?
qualitative/quantitative: how well do you estimate abundance, fold-change?
Spike-In and Dilution series
… great, but how representative are they of other data?
Implicitely, from resampling / cross-validating with the actual experiment of interest
… but isn’t that too much like Münchhausen’s bootstrap?
Spike-in series: from Affymetrix 59 x HGU95A,
16 genes, 14 concentrations, complex background
Dilution series: from GeneLogic 60 x HGU95Av2,
liver & CNS cRNA in different proportions and amounts
15 quality measures regarding
Put together by Rafael Irizarry (Johns Hopkins) http://affycomp.biostat.jhsph.edu
Scatterplot, colored by PCR-plate
Two RZPD Unigene II filters (cDNA nylon membranes)
spotting pin quality decline
after delivery of 5x105 spots
after delivery of 3x105 spots
H. Sueltmann DKFZ/MGA
R RbR-Rbcolor scale by rank
another array: print-tip
color scale ~ log(G)
color scale ~ rank(G)
spotted cDNA arrays, Stanford-type
Batches: array to array differences dij = madk(hik -hjk)
arrays i=1…63; roughly sorted by time
Density representation of the scatterplot
(76,000 clones, RZPD Unigene-II filters)
See: package hexbin; also, smoothscatter in package prada
Main software from Affymetrix:
MAS - MicroArray Suite.
DAT file: Image file, ~108 pixels, ~200 MB.
CEL file: probe intensities, ~106 numbers
CDF file: Chip Description File. Describes which probes go in which probe sets (genes, gene fragments, ESTs).
1LQ file: Probe sequences and intended targets in the transcriptome
DAT image files CEL files
Each probe cell: 10x10 pixels.
Gridding: estimate location of probe cell centers.
Remove outer 36 pixels 8x8 pixels.
The probe cell signal, PM or MM, is the 75th percentile of the 8x8 pixel values.
Background: Average of the lowest 2% probe cells is taken as the background value and subtracted.
Compute also quality values.
PMijg , MMijg= Intensities for perfect match and
mismatch probe j for gene g in chip i
i = 1,…, none to hundreds of chips
j = 1,…, J usually 11 or 16 probe pairs
g= 1,…, G 6…30,000 probe sets.
calibrate (normalize) the measurements from different chips (samples)
summarize for each probe set the probe level data, i.e., 11 PM and MM pairs, into a single expression measure.
compare between chips (samples) for detecting differential expression.
Affymetrix GeneChip MAS 4.0 software uses AvDiff, a trimmed mean:
o sort dj = PMj -MMj
o exclude highest and lowest value
o J := those pairs within 3 standard deviations of the average
Instead of MM, use "repaired" version CT
CT=MM if MM<PM
= PM / "typical log-ratio"if MM>=PM
Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise
dChip fits a model for each gene
qi: expression index for gene i
fj: probe sensitivity
Maximum likelihood estimate of MBEI is used as expression measure of the gene in chip i.
Need at least 10 or 20 chips.
Current version works with PMs only.
o Estimate one global background value b=mode(MM). No probe-specific background!
o Assume: PM = strue + b
Estimate s0 from PM and b as a conditional expectation E[strue|PM, b].
o Use log2(s).
o Nonparametric nonlinear calibration ('quantile normalization') across a set of chips.
with A a set of “suitable” pairs.
Li-Wong-like: additive model
Estimate RMA = ai for chip i using robust method median polish (successively remove row and column medians, accumulate terms, until convergence). Works with d>=2
Expression measures RMA: Irizarry et al. (2002)
From: R. Irizarry et al., Biostatistics 2002
position- and sequence-specific effects wi(s):
Naef et al., Phys Rev E 68 (2003)
Bioconductor R package affy.
Two main functions: ReadAffy, expresso.
See also: gcrma, tilingArray, vsn.
Bioinformatics and computational biology solutions using R and Bioconductor, R. Gentleman, V. Carey, W. Huber, R. Irizarry, S. Dudoit, Springer (2005).
Variance stabilization applied to microarray data calibration and to the quantification of differential expression. W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, M. Vingron. Bioinformatics 18 suppl. 1 (2002), S96-S104.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. R. Irizarry, B. Hobbs, F. Collins, …, T. Speed. Biostatistics 4 (2003) 249-264.
Error models for microarray intensities. W. Huber, A. von Heydebreck, and M. Vingron. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).
Differential Expression with the Bioconductor Project. A. von Heydebreck, W. Huber, and R. Gentleman. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).