Quality control and normalization
1 / 57

Quality control and normalization - PowerPoint PPT Presentation

  • Uploaded on

Quality control and normalization. Wolfgang Huber European Bioinformatics Institute. Anja von Heydebreck (Darmstadt) Robert Gentleman (Seattle) Günther Sawitzki (Heidelberg) Martin Vingron (Berlin) Annemarie Poustka, Holger Sültmann, Andreas Buness, Markus Ruschhaupt (Heidelberg)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Quality control and normalization' - garret

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Quality control and normalization
Quality control and normalization

Wolfgang Huber

European Bioinformatics Institute


Anja von Heydebreck (Darmstadt)

Robert Gentleman (Seattle)

Günther Sawitzki (Heidelberg)

Martin Vingron (Berlin)

Annemarie Poustka, Holger Sültmann, Andreas Buness, Markus Ruschhaupt (Heidelberg)

Rafael Irizarry (Baltimore)

Judith Boer (Leiden)

Anke Schroth (Heidelberg)

Friederike Wilmer (Hilden)


Which genes are differentially transcribed


Which genes are differentially transcribed?



Statistics 101
Statistics 101:


 precision variance

Basic dogma of data analysis
Basic dogma of data analysis:

Can always increase sensitivity on the cost of specificity,

or vice versa,

the art is to find the best trade-off.










(It can also be possible to increase both by better choice of method / model)

Ratios and fold changes

















But what if the gene is “off” (below detection limit) in one condition?

ratios and fold changes

Fold changes are useful to describe continuous changes in expression

Ratios and fold changes1
ratios and fold changes

The idea of the log-ratio (base 2)

0: no change

+1: up by factor of 21 = 2

+2: up by factor of 22 = 4

-1: down by factor of 2-1 = 1/2

-2: down by factor of 2-2 = ¼

A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.

What about a change from 0 to 500?

- conceptually

- noise, measurement precision

A complex measurement process lies between mrna concentrations and intensities
A complex measurement process lies between mRNA concentrations and intensities

The problem is less that these steps are ‘not perfect’; it is that they vary from array to array, experiment to experiment.


concentrations and intensities How to compare microarray intensities with each other?

 How to address measurement uncertainty (“variance”)?

 How to calibrate (“normalize”) for biases between samples?


Sources of variation

Systematic concentrations and intensities


o similar effect on many


o corrections can be estimated from data

o too random to be ex-plicitely accounted for

o remain as “noise”


Error model

Sources of variation

amount of RNA in the biopsy

efficiencies of

-RNA extraction

-reverse transcription


-fluorescent detection

probe purity and length distribution

spotting efficiency, spot size

cross-/unspecific hybridization

stray signal

Error models
concentrations and intensitiesError models

  • describe the possible outcomes of a set of measurements

  • Outcomes depend on:

  • true value of the measured quantity

  • (abundances of specific molecules in biological sample)

  • measurement apparatus

  • (cascade of biochemical reactions, optical detection system with laser scanner or CCD camera)

Error models1
concentrations and intensitiesError models

  • Purpose:

  • Data compression: summary statistic instead of full empirical distribution

  • Quality control

  • Statistical inference: appropriate parametric methods have better power than non-parametric (this has practical, financial, andethical aspects)

Quality control and normalization

b concentrations and intensitiesi per-sample

normalization factor

bk sequence-wise

probe efficiency

hik ~ N(0,s22)

“multiplicative noise”

ai per-sample offset

eik ~ N(0, bi2s12)

“additive noise”

 The two component model

measured intensity = offset + gain  true abundance

The two component model

“multiplicative” noise concentrations and intensities

“additive” noise

 The two-component model

raw scale

log scale

B. Durbin, D. Rocke, JCB 2001

concentrations and intensitiesParameterization

two practically equivalent forms (h<<1)

Variance stabilizing transformations
concentrations and intensitiesvariance stabilizing transformations

Xu a family of random variables with EXu=u, VarXu=v(u). Define

var f(Xu ) independent of u

derivation: linear approximation

Variance stabilizing transformations1
concentrations and intensitiesvariance stabilizing transformations



Variance stabilizing transformations2

1.) constant variance (‘additive’) concentrations and intensities

2.) constant CV (‘multiplicative’)

3.) offset

4.) additive and multiplicative

 variance stabilizing transformations

The glog transformation
concentrations and intensitiesthe “glog” transformation

- - - f(x) = log(x)

———hs(x) = asinh(x/s)

P. Munson, 2001

D. Rocke & B. Durbin, ISMB 2002

W. Huber et al., ISMB 2002

Quality control and normalization

generalized concentrations and intensities





constant part

proportional part


raw scale



The transformed model
concentrations and intensitiesthe transformed model

i: arrays

k: probes

s: probe strata (e.g. print-tip, region)

Quality control and normalization

“usual” log-ratio concentrations and intensities

'glog' (generalized log-ratio)

c1, c2are experiment specific parameters (~level of background noise)

Variance bias trade off
concentrations and intensitiesVariance Bias Trade-Off

Estimated log-fold-change



Signal intensity

Variance bias trade off and shrinkage estimators
concentrations and intensitiesVariance-bias trade-off and shrinkage estimators

Shrinkage estimators:

pay a small price in bias for a large decrease of variance, so overall the mean-squared-error (MSE) is reduced.

Particularly useful if you have few replicates.

Generalized log-ratio:

= a shrinkage estimator for fold change

There are many possible choices, we chose “variance-stabilization”:

+ interpretable even in cases where genes are off in some conditions

+ can subsequently use standard statistical methods (hypothesis testing, ANOVA, clustering, classification…) without the worries about low-level variability that are often warranted on the log-scale

Evaluation effects of different data transformations
evaluation: effects of different data transformations concentrations and intensities

difference red-green


Normality qq plot
concentrations and intensitiesNormality: QQ-plot

Single color normalization
concentrations and intensities “Single color normalization”

  • n red-green arrays (R1, G1, R2, G2,… Rn, Gn)

  • within/between slides

  • for (i=1:n)

    • calculate Mi= log(Ri/Gi), Ai= ½ log(Ri*Gi)

    • normalize Mi vs Ai

    • normalize M1…Mn

  • all at once

  • normalize the matrix of (R, G)

  • then calculate log-ratios or any other contrast you like

What about non linear effects
concentrations and intensitiesWhat about non-linear effects

o Microarrays can be operated in a linear regime, where fluorescence intensity increases proportionally to target abundance (see e.g. Affymetrix dilution series)

Two reasons for non-linearity:

oAt the high intensity end:saturation/quenching. This can and should be avoided experimentally - loss of data!

oAt the low intensity end:background offsets, instead of y=k·x we have y=k·x+x0, and in the log-log plot this can look curvilinear. But this is an affine-linear effect and can be correct by affine normalization. Non-parametric methods (e.g. loess) risk overfitting and loss of power.

Non linear or affine linear
concentrations and intensitiesNon-linear or affine linear?

concentrations and intensitiesDefinitions


affine linear

genuinely non-linear

How to compare and assess different normalization methods
concentrations and intensities How to compare and assess different ‘normalization’ methods?

Normalization :=

1. correction for systematic experimental biases

2. provision of expression values that can subsequently be used for testing, clustering, classification, modelling…

3. provision of a measure of measurement uncertainty

Quality trade-off: the better the measurements, the less need for normalization. Need for “too much” normalization relates to a quality problem.

Variance-Bias trade-off: how do you weigh measurements that have low signal-noise ratio?

- just use anyway

- ignore

- shrink

How to compare and assess different normalization methods1
concentrations and intensities How to compare and assess different ‘normalization’ methods?

Aesthetic criteria

Logarithm is more beautiful than arsinh

Practical critera

It takes forever to run method XX. Referees will only accept my paper if it uses the original MAS5.

Silly criteria

The best method is that that makes all my scatterplots look like straight, slim cigars

Physical criteria

Normalization calculations should be based on physical/chemical model

Economical/political criteria

Life would be so much easier if everybody were just using the same method, who cares which one

How to compare and assess different normalization methods2
concentrations and intensities How to compare and assess different ‘normalization’ methods?

Comparison against a ground truth

But you have millions of numbers – need to choose the metric that measures deviation from truth.

FN/FP: do you find all the differentially expressed genes, and do you not find non-d.e. genes?

qualitative/quantitative: how well do you estimate abundance, fold-change?

Spike-In and Dilution series

… great, but how representative are they of other data?

Implicitely, from resampling / cross-validating with the actual experiment of interest

… but isn’t that too much like Münchhausen’s bootstrap?

Evaluation a benchmark for affymetrix genechip expression measures

concentrations and intensities

evaluation: a benchmark for Affymetrix genechip expression measures

o Data:

Spike-in series: from Affymetrix 59 x HGU95A,

16 genes, 14 concentrations, complex background

Dilution series: from GeneLogic 60 x HGU95Av2,

liver & CNS cRNA in different proportions and amounts

o Benchmark:

15 quality measures regarding




Put together by Rafael Irizarry (Johns Hopkins) http://affycomp.biostat.jhsph.edu

Quality assessment and control an overview over some diagnostic plots and common artifacts
Quality assessment and control: concentrations and intensitiesan overview over some diagnostic plots and common artifacts

Pcr plates
PCR plates concentrations and intensities

Scatterplot, colored by PCR-plate

Two RZPD Unigene II filters (cDNA nylon membranes)

Pcr plates1
PCR plates concentrations and intensities

Pcr plates boxplots
PCR plates: boxplots concentrations and intensities

Array batches
array batches concentrations and intensities

Print tip effects
print-tip effects concentrations and intensities


q (log-ratio)

Quality control and normalization

spotting pin quality decline concentrations and intensities

after delivery of 5x105 spots

after delivery of 3x105 spots

H. Sueltmann DKFZ/MGA

Spatial effects
spatial effects concentrations and intensities

R Rb R-Rbcolor scale by rank

another array: print-tip

color scale ~ log(G)

color scale ~ rank(G)

spotted cDNA arrays, Stanford-type

Quality control and normalization

Batches: concentrations and intensities array to array differences dij = madk(hik -hjk)

arrays i=1…63; roughly sorted by time

Quality control and normalization

Density representation of the scatterplot concentrations and intensities

(76,000 clones, RZPD Unigene-II filters)

See: package hexbin; also, smoothscatter in package prada

Oligonucleotide chips
Oligonucleotide chips concentrations and intensities

Affymetrix files

Main software from Affymetrix: concentrations and intensities

MAS - MicroArray Suite.

DAT file: Image file, ~108 pixels, ~200 MB.

CEL file: probe intensities, ~106 numbers

CDF file: Chip Description File. Describes which probes go in which probe sets (genes, gene fragments, ESTs).

1LQ file: Probe sequences and intended targets in the transcriptome

Affymetrix files

Image analysis

DAT image files concentrations and intensities CEL files

Each probe cell: 10x10 pixels.

Gridding: estimate location of probe cell centers.


Remove outer 36 pixels  8x8 pixels.

The probe cell signal, PM or MM, is the 75th percentile of the 8x8 pixel values.

Background: Average of the lowest 2% probe cells is taken as the background value and subtracted.

Compute also quality values.

Image analysis

Data and notation

PM concentrations and intensitiesijg , MMijg= Intensities for perfect match and

mismatch probe j for gene g in chip i

i = 1,…, n one to hundreds of chips

j = 1,…, J usually 11 or 16 probe pairs

g= 1,…, G 6…30,000 probe sets.


calibrate (normalize) the measurements from different chips (samples)

summarize for each probe set the probe level data, i.e., 11 PM and MM pairs, into a single expression measure.

compare between chips (samples) for detecting differential expression.

Data and notation

Expression measures mas 4 0

Affymetrix GeneChip MAS 4.0 software uses concentrations and intensitiesAvDiff, a trimmed mean:

o sort dj = PMj -MMj

o exclude highest and lowest value

o J := those pairs within 3 standard deviations of the average

expression measures: MAS 4.0

Expression measures mas 5 0

Instead of MM, use "repaired" version CT concentrations and intensities


= PM / "typical log-ratio" if MM>=PM

"Signal" =

Tukey.Biweight (log(PM-CT))


Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise

Expression measures MAS 5.0

Expression measures li wong

dChip concentrations and intensities fits a model for each gene


qi: expression index for gene i

fj: probe sensitivity

Maximum likelihood estimate of MBEI is used as expression measure of the gene in chip i.

Need at least 10 or 20 chips.

Current version works with PMs only.

Expression measures: Li & Wong

Expression measures rma irizarry et al 2002

o concentrations and intensities Estimate one global background value b=mode(MM). No probe-specific background!

o Assume: PM = strue + b

Estimate s0 from PM and b as a conditional expectation E[strue|PM, b].

o Use log2(s).

o Nonparametric nonlinear calibration ('quantile normalization') across a set of chips.

Expression measures RMA: Irizarry et al. (2002)

Quality control and normalization

AvDiff-like concentrations and intensities

with A a set of “suitable” pairs.

Li-Wong-like: additive model

Estimate RMA = ai for chip i using robust method median polish (successively remove row and column medians, accumulate terms, until convergence). Works with d>=2

Expression measures RMA: Irizarry et al. (2002)

Affymetrix i pm i mm i specific
Affymetrix: I concentrations and intensitiesPM = IMM + Ispecific ?


From: R. Irizarry et al., Biostatistics 2002


Sequence dependent preprocessing
concentrations and intensities Sequence-dependent preprocessing

position- and sequence-specific effects wi(s):

Naef et al., Phys Rev E 68 (2003)



Software for pre processing affymetrix data

Bioconductor R package concentrations and intensitiesaffy.

Background estimation.

Probe-level normalization.

Expression measures

Two main functions: ReadAffy, expresso.

See also: gcrma, tilingArray, vsn.

Software for pre-processing Affymetrix data


Bioinformatics and computational biology solutions using R and Bioconductor, R. Gentleman, V. Carey, W. Huber, R. Irizarry, S. Dudoit, Springer (2005).

Variance stabilization applied to microarray data calibration and to the quantification of differential expression. W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, M. Vingron. Bioinformatics 18 suppl. 1 (2002), S96-S104.

Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. R. Irizarry, B. Hobbs, F. Collins, …, T. Speed. Biostatistics 4 (2003) 249-264.

Error models for microarray intensities. W. Huber, A. von Heydebreck, and M. Vingron. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).

Differential Expression with the Bioconductor Project. A. von Heydebreck, W. Huber, and R. Gentleman. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).