Quality control and normalization

Quality control and normalization Wolfgang Huber European Bioinformatics Institute

Anja von Heydebreck (Darmstadt) Robert Gentleman (Seattle) Günther Sawitzki (Heidelberg) Martin Vingron (Berlin) Annemarie Poustka, Holger Sültmann, Andreas Buness, Markus Ruschhaupt (Heidelberg) Rafael Irizarry (Baltimore) Judith Boer (Leiden) Anke Schroth (Heidelberg) Friederike Wilmer (Hilden) Acknowledgements

log-ratio Which genes are differentially transcribed? same-same tumor-normal

Statistics 101: biasaccuracy  precision variance

Basic dogma of data analysis: Can always increase sensitivity on the cost of specificity, or vice versa, the art is to find the best trade-off. X X X X X X X X X (It can also be possible to increase both by better choice of method / model)

3000 3000 x3 ? 1500 200 1000 0 ? x1.5 A A B B C C But what if the gene is “off” (below detection limit) in one condition? ratios and fold changes Fold changes are useful to describe continuous changes in expression

ratios and fold changes The idea of the log-ratio (base 2) 0: no change +1: up by factor of 21 = 2 +2: up by factor of 22 = 4 -1: down by factor of 2-1 = 1/2 -2: down by factor of 2-2 = ¼ A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000. What about a change from 0 to 500? - conceptually - noise, measurement precision

A complex measurement process lies between mRNA concentrations and intensities The problem is less that these steps are ‘not perfect’; it is that they vary from array to array, experiment to experiment.

 How to compare microarray intensities with each other?  How to address measurement uncertainty (“variance”)?  How to calibrate (“normalize”) for biases between samples? Questions

Systematic Stochastic o similar effect on many measurements o corrections can be estimated from data o too random to be ex-plicitely accounted for o remain as “noise” Calibration Error model Sources of variation amount of RNA in the biopsy efficiencies of -RNA extraction -reverse transcription -labeling -fluorescent detection probe purity and length distribution spotting efficiency, spot size cross-/unspecific hybridization stray signal

Error models • describe the possible outcomes of a set of measurements • Outcomes depend on: • true value of the measured quantity • (abundances of specific molecules in biological sample) • measurement apparatus • (cascade of biochemical reactions, optical detection system with laser scanner or CCD camera)

Error models • Purpose: • Data compression: summary statistic instead of full empirical distribution • Quality control • Statistical inference: appropriate parametric methods have better power than non-parametric (this has practical, financial, andethical aspects)

bi per-sample normalization factor bk sequence-wise probe efficiency hik ~ N(0,s22) “multiplicative noise” ai per-sample offset eik ~ N(0, bi2s12) “additive noise”  The two component model measured intensity = offset + gain  true abundance

“multiplicative” noise “additive” noise  The two-component model raw scale log scale B. Durbin, D. Rocke, JCB 2001

Parameterization two practically equivalent forms (h<<1)

 variance stabilizing transformations Xu a family of random variables with EXu=u, VarXu=v(u). Define var f(Xu ) independent of u derivation: linear approximation

variance stabilizing transformations f(x) x

1.) constant variance (‘additive’) 2.) constant CV (‘multiplicative’) 3.) offset 4.) additive and multiplicative  variance stabilizing transformations

the “glog” transformation - - - f(x) = log(x) ———hs(x) = asinh(x/s) P. Munson, 2001 D. Rocke & B. Durbin, ISMB 2002 W. Huber et al., ISMB 2002

generalized log-ratio difference log-ratio variance: constant part proportional part glog raw scale log glog

the transformed model i: arrays k: probes s: probe strata (e.g. print-tip, region)

“usual” log-ratio 'glog' (generalized log-ratio) c1, c2are experiment specific parameters (~level of background noise)

 Variance Bias Trade-Off Estimated log-fold-change log glog Signal intensity

 Variance-bias trade-off and shrinkage estimators Shrinkage estimators: pay a small price in bias for a large decrease of variance, so overall the mean-squared-error (MSE) is reduced. Particularly useful if you have few replicates. Generalized log-ratio: = a shrinkage estimator for fold change There are many possible choices, we chose “variance-stabilization”: + interpretable even in cases where genes are off in some conditions + can subsequently use standard statistical methods (hypothesis testing, ANOVA, clustering, classification…) without the worries about low-level variability that are often warranted on the log-scale

evaluation: effects of different data transformations difference red-green rank(average)

 Normality: QQ-plot

 “Single color normalization” • n red-green arrays (R1, G1, R2, G2,… Rn, Gn) • within/between slides • for (i=1:n) • calculate Mi= log(Ri/Gi), Ai= ½ log(Ri*Gi) • normalize Mi vs Ai • normalize M1…Mn • all at once • normalize the matrix of (R, G) • then calculate log-ratios or any other contrast you like

 What about non-linear effects o Microarrays can be operated in a linear regime, where fluorescence intensity increases proportionally to target abundance (see e.g. Affymetrix dilution series) Two reasons for non-linearity: oAt the high intensity end:saturation/quenching. This can and should be avoided experimentally - loss of data! oAt the low intensity end:background offsets, instead of y=k·x we have y=k·x+x0, and in the log-log plot this can look curvilinear. But this is an affine-linear effect and can be correct by affine normalization. Non-parametric methods (e.g. loess) risk overfitting and loss of power.

 Non-linear or affine linear?

 Definitions linear affine linear genuinely non-linear

 How to compare and assess different ‘normalization’ methods? Normalization := 1. correction for systematic experimental biases 2. provision of expression values that can subsequently be used for testing, clustering, classification, modelling… 3. provision of a measure of measurement uncertainty Quality trade-off: the better the measurements, the less need for normalization. Need for “too much” normalization relates to a quality problem. Variance-Bias trade-off: how do you weigh measurements that have low signal-noise ratio? - just use anyway - ignore - shrink

 How to compare and assess different ‘normalization’ methods? Aesthetic criteria Logarithm is more beautiful than arsinh Practical critera It takes forever to run method XX. Referees will only accept my paper if it uses the original MAS5. Silly criteria The best method is that that makes all my scatterplots look like straight, slim cigars Physical criteria Normalization calculations should be based on physical/chemical model Economical/political criteria Life would be so much easier if everybody were just using the same method, who cares which one

 How to compare and assess different ‘normalization’ methods? Comparison against a ground truth But you have millions of numbers – need to choose the metric that measures deviation from truth. FN/FP: do you find all the differentially expressed genes, and do you not find non-d.e. genes? qualitative/quantitative: how well do you estimate abundance, fold-change? Spike-In and Dilution series … great, but how representative are they of other data? Implicitely, from resampling / cross-validating with the actual experiment of interest … but isn’t that too much like Münchhausen’s bootstrap?

 evaluation: a benchmark for Affymetrix genechip expression measures o Data: Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex background Dilution series: from GeneLogic 60 x HGU95Av2, liver & CNS cRNA in different proportions and amounts o Benchmark: 15 quality measures regarding -reproducibility -sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins) http://affycomp.biostat.jhsph.edu

Quality assessment and control: an overview over some diagnostic plots and common artifacts

PCR plates Scatterplot, colored by PCR-plate Two RZPD Unigene II filters (cDNA nylon membranes)

PCR plates

PCR plates: boxplots

array batches

print-tip effects F(q) q (log-ratio)

spotting pin quality decline after delivery of 5x105 spots after delivery of 3x105 spots H. Sueltmann DKFZ/MGA

spatial effects R Rb R-Rbcolor scale by rank another array: print-tip color scale ~ log(G) color scale ~ rank(G) spotted cDNA arrays, Stanford-type

Batches: array to array differences dij = madk(hik -hjk) arrays i=1…63; roughly sorted by time

Density representation of the scatterplot (76,000 clones, RZPD Unigene-II filters) See: package hexbin; also, smoothscatter in package prada

Oligonucleotide chips

Main software from Affymetrix: MAS - MicroArray Suite. DAT file: Image file, ~108 pixels, ~200 MB. CEL file: probe intensities, ~106 numbers CDF file: Chip Description File. Describes which probes go in which probe sets (genes, gene fragments, ESTs). 1LQ file: Probe sequences and intended targets in the transcriptome Affymetrix files

DAT image files  CEL files Each probe cell: 10x10 pixels. Gridding: estimate location of probe cell centers. Signal: Remove outer 36 pixels  8x8 pixels. The probe cell signal, PM or MM, is the 75th percentile of the 8x8 pixel values. Background: Average of the lowest 2% probe cells is taken as the background value and subtracted. Compute also quality values. Image analysis

PMijg , MMijg= Intensities for perfect match and mismatch probe j for gene g in chip i i = 1,…, n one to hundreds of chips j = 1,…, J usually 11 or 16 probe pairs g= 1,…, G 6…30,000 probe sets. Tasks: calibrate (normalize) the measurements from different chips (samples) summarize for each probe set the probe level data, i.e., 11 PM and MM pairs, into a single expression measure. compare between chips (samples) for detecting differential expression. Data and notation

Affymetrix GeneChip MAS 4.0 software uses AvDiff, a trimmed mean: o sort dj = PMj -MMj o exclude highest and lowest value o J := those pairs within 3 standard deviations of the average expression measures: MAS 4.0

Instead of MM, use "repaired" version CT CT= MM if MM<PM = PM / "typical log-ratio" if MM>=PM "Signal" = Tukey.Biweight (log(PM-CT)) (…median) Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise Expression measures MAS 5.0

Quality control and normalization

Quality control and normalization

Presentation Transcript

Quality control and documentation

Quality, Quality Control

INSPECTION AND QUALITY CONTROL

QUALITY CONTROL AND SPC

QUALITY CONTROL AND SPC

Microarray normalization, error models, quality

Microarray normalization, error models, quality

Quality Assurance and Control

Microarray Pre-processing, quality control and normalization

Software Quality Assurance and Quality Control

Reliability and Quality Control

Documentation Control and Quality Control

TQM and Quality Control

Stock Control, Quality Control and Quality Assurance

Quality Control and Quality Assurance

Quality Management and Control

Managerial and Quality Control

Managerial and Quality Control

Quality and Statistical Quality Control Techniques

Stock Control, Quality Control and Quality Assurance

Documentation Control and Quality Control