Microarray normalization, error models, quality
Download
1 / 80

Microarray normalization, error models, quality - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

Microarray normalization, error models, quality. Wolfgang Huber EMBL-EBI Brixen 16 June 2008. Oligonucleotide microarrays. Base Pairing. *. *. *. *. *. Oligonucleotide microarrays. GeneChip. Hybridized Probe Cell. Target - single stranded cDNA. Oligonucleotide probe. 5µm.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Microarray normalization, error models, quality' - jens


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Microarray normalization, error models, quality

Wolfgang Huber

EMBL-EBI

Brixen 16 June 2008




Oligonucleotide microarrays1

*

*

*

*

*

Oligonucleotide microarrays

GeneChip

Hybridized Probe Cell

Target - single stranded

cDNA

Oligonucleotide probe

5µm

Millions of copies of a specific

oligonucleotide probe molecule per patch

1.28cm

up to 6.5 Mio

different probe patches

Image of array after hybridisation and staining



Terminology for transcription arrays

Each target molecule (transcript) is represented by several oligonucleotides of (intended) length 25 bases

Probe: one of these 25-mer oligonucleotides

Perfect match (PM): A probe exactly complementary to the target sequence

Mismatch (MM): same as PM but with a single homomeric base change for the middle (13th) base (G-C, A-T)

Probe-pair: a (PM, MM) pair.

Probe set: a collection of probe-pairs (e.g. 11) targeting the same transcript

MGED/MIAME: „probe“ is ambiguous!

Reporter: the sequence

Feature: a physical patch on the array with molecules intended to have the same reporter sequence (one reporter can be represented by multiple features)

Terminology for transcription arrays


Image analysis
Image analysis oligonucleotides of (intended) length 25 bases

  • about 100 pixels per feature

  • segmentation

  • summarisation into one number representing the intensity level for this feature

  •  CEL file


M array data

samples: oligonucleotides of (intended) length 25 bases

mRNA from

tissue biopsies,

cell lines

fluorescent detection of the amount of sample-probe binding

arrays:

probes = gene-specific DNA strands

tissue A

tissue B

tissue C

ErbB2

0.02

1.12

2.12

VIM

1.1

5.8

1.8

ALDH4

2.2

0.6

1.0

CASP4

0.01

0.72

0.12

LAMA4

1.32

1.67

0.67

MCAM

4.2

2.93

3.31

marray data


Why do you need normalisation
Why do you need ‘normalisation’? oligonucleotides of (intended) length 25 bases


b oligonucleotides of (intended) length 25 basesias

From: lymphoma dataset

vsn package

Alizadeh et al., Nature 2000


Ma plot
MA-plot oligonucleotides of (intended) length 25 bases

M

A


log oligonucleotides of (intended) length 25 bases2 intensity

arrays / dyes


Non-linearity oligonucleotides of (intended) length 25 bases

log2

Cope et al. Bioinformatics 2003


Ratio compression
oligonucleotides of (intended) length 25 basesratio compression

nominal 3:1

nominal 1:1

nominal 1:3

Yue et al., (Incyte Genomics) NAR (2001) 29 e41


A complex measurement process lies between mrna concentrations and intensities
A complex measurement process lies between mRNA concentrations and intensities

The problem is less that these steps are ‘not perfect’; it is that they vary from array to array, experiment to experiment.


Why do you need statistics
Why do you need statistics? concentrations and intensities


Which genes are differentially transcribed

tumor-normal concentrations and intensities

log-ratio

Which genes are differentially transcribed?

same-same


Statistics 101
Statistics 101: concentrations and intensities

biasaccuracy

 precision variance


Basic dogma of data analysis
Basic dogma of data analysis concentrations and intensities

Can always increase sensitivity on the cost of specificity, or vice versa, the art is to

- optimize both, then

- find the best trade-off.

X

X

X

X

X

X

X

X

X


Questions

concentrations and intensities How to compare microarray intensities with each other?

 How to address measurement uncertainty (“variance”)?

 How to calibrate (“normalize”) for biases between samples?

Questions


Sources of variation

Systematic concentrations and intensities

Stochastic

o similar effect on many

measurements

o corrections can be estimated from data

o too random to be ex-plicitely accounted for

o remain as “noise”

Calibration

Error model

Sources of variation

amount of RNA in the biopsy

efficiencies of

-RNA extraction

-reverse transcription

-labeling

-fluorescent detection

probe purity and length distribution

spotting efficiency, spot size

cross-/unspecific hybridization

stray signal


Quantile normalisation
Quantile normalisation concentrations and intensities


Quantile normalisation1

Ben Bolstad 2001 concentrations and intensities

Quantile normalisation


data("Dilution") concentrations and intensities

nq = normalize.quantiles(exprs(Dilution))

nr = apply(exprs(Dilution), 2, rank)

for(i in 1:4) plot(nr[,i], nq[,i], pch=".", log="y", xlab="rank",

ylab="quantile normalized", main=sampleNames(Dilution)[i])


Quantile normalisation is: per array rank-transformation followed by replacing ranks with values from a common reference distribution


Quantile normalisation2

+ followed by replacing ranks with values from a common reference distribution Simple, fast, easy to implement

+ Always works, needs no user interaction / tuning

+ Non-parametric: can correct for quite nasty non-linearities (saturation, background) in the data

- Always "works", even if data are bad / inappropriate

- Conservative: rank transformation looses information - may yield less power to detect differentially expressed genes

- Aggressive: when there is an excess of up- (or down) regulated genes, it removes not just technical, but also biological variation

Quantile normalisation


Loess normalisation
loess normalisation followed by replacing ranks with values from a common reference distribution


Loess normalisation1

loess followed by replacing ranks with values from a common reference distribution(locally weighted scatterplot smoothing): an algorithm for robust local polynomial regression by W. S. Cleveland and colleagues (AT&T, 1980s) and handily available in R

"loess" normalisation


Local polynomial regression
Local polynomial regression followed by replacing ranks with values from a common reference distribution

bandwidth b


Robust regression
Robust regression followed by replacing ranks with values from a common reference distribution


C. Loader followed by replacing ranks with values from a common reference distribution

Local Regression and Likelihood

Springer Verlag


Loess normalisation2
loess normalisation followed by replacing ranks with values from a common reference distribution

  • local polynomial regression of M against A

  • 'normalised' M-values are the residuals

before

after


Local polynomial regression normalisation in 2 dimensions
local polynomial regression normalisation in >2 dimensions followed by replacing ranks with values from a common reference distribution


N dimensional local regression model for microarray normalisation
n followed by replacing ranks with values from a common reference distribution-dimensional local regression model for microarray normalisation

An algorithm for fitting this robustly is described (roughly) in the paper. They only provided software as a compiled binary for Windows. The method has not found much use.


Estimating relative expression fold changes
Estimating relative expression followed by replacing ranks with values from a common reference distribution(fold-changes)


Ratios and fold changes

3000 followed by replacing ranks with values from a common reference distribution

3000

x3

?

1500

200

1000

0

?

x1.5

A

A

B

B

C

C

But what if the gene is “off” (below detection limit) in one condition?

ratios and fold changes

Fold changes are useful to describe continuous changes in expression


Ratios and fold changes1
followed by replacing ranks with values from a common reference distributionratios and fold changes

The idea of the log-ratio (base 2)

0: no change

+1: up by factor of 21 = 2

+2: up by factor of 22 = 4

-1: down by factor of 2-1 = 1/2

-2: down by factor of 2-2 = ¼

A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.

What about a change from 0 to 500?

- conceptually

- noise, measurement precision


What is wrong with microarray data

Many data are measured in definite units: followed by replacing ranks with values from a common reference distribution

time in seconds

lengths in meters

energy in Joule, etc.

Climb Mount Plose (2465 m) from Brixen (559 m) with weight of 78 kg, working against a gravitation field of strength 9.81 m/s2 :

What is wrong with microarray data?

(2465 - 559) · 78 · 9.81 m kg m/s2

= 1 458 433 kg m2 s-2

= 1 458.433 kJ


Vsn normalisation
vsn normalisation followed by replacing ranks with values from a common reference distribution


Robust affine regression normalisation n dim with the additive multiplicative error model
Robust affine regression normalisation ( followed by replacing ranks with values from a common reference distributionn-dim.) with the additive-multiplicative error model

Error model:

Trey Ideker et al. (2000) JCB

David Rocke and Blythe Durbin:

(2001) JCB, (2002) Bioinformatics

Use for robust affine regression normalisation:

W. Huber, Anja von Heydebreck et al. (2002) Bioinformatics


b followed by replacing ranks with values from a common reference distributioni per-sample

gain factor

bk sequence-wise

probe efficiency

hik multiplicative noise

ai per-sample offset

eik additive noise

 The two component model

measured intensity = offset + gain  true abundance


The two component model

“multiplicative” noise followed by replacing ranks with values from a common reference distribution

“additive” noise

The two-component model

raw scale

log scale

B. Durbin, D. Rocke, JCB 2001


Parameterization
followed by replacing ranks with values from a common reference distributionParameterization

two practically equivalent forms (h<<1)


Variance stabilizing transformations
followed by replacing ranks with values from a common reference distributionvariance stabilizing transformations

Xu a family of random variables with E Xu = u, Var Xu = v(u). Define

Var f(Xu ) independent of u

derivation: linear approximation


Variance stabilizing transformation
followed by replacing ranks with values from a common reference distributionvariance stabilizing transformation

f(x)

x


Variance stabilizing transformations1

1.) constant variance (‘additive’) followed by replacing ranks with values from a common reference distribution

2.) constant CV (‘multiplicative’)

3.) offset

4.) additive and multiplicative

 variance stabilizing transformations


The glog transformation
followed by replacing ranks with values from a common reference distributionthe “glog” transformation

- - - f(x) = log(x)

———hs(x) = asinh(x/s)

P. Munson, 2001

D. Rocke & B. Durbin, ISMB 2002

W. Huber et al., ISMB 2002


generalized followed by replacing ranks with values from a common reference distribution

log-ratio

difference

log-ratio

variance:

constant part

proportional part

glog

raw scale

log

glog


difference red-green followed by replacing ranks with values from a common reference distribution

rank(average)


“usual” log-ratio followed by replacing ranks with values from a common reference distribution

'glog' (generalized log-ratio)

c1, c2are experiment specific parameters (~level of background noise)


Variance bias trade off
followed by replacing ranks with values from a common reference distributionVariance Bias Trade-Off

Estimated log-fold-change

log

glog

Signal intensity


Variance bias trade off and shrinkage estimators
followed by replacing ranks with values from a common reference distributionVariance-bias trade-off and shrinkage estimators

Shrinkage estimators:

a general technology in statistics:

pay a small price in bias for a large decrease of variance, so overall the mean-squared-error (MSE) is reduced.

Particularly useful if you have few replicates.

Generalized log-ratio is a shrinkage estimator for fold change


Background correction
Background correction followed by replacing ranks with values from a common reference distribution


Background correction1
Background correction followed by replacing ranks with values from a common reference distribution

0 pm

750 fm

1 pm

500 fm

Irizarry et al. Biostatistics 2003


Background correction rma irizarry et al 2002
Background correction: followed by replacing ranks with values from a common reference distributionRMA, Irizarry et al. (2002)


Background correction2
Background correction: followed by replacing ranks with values from a common reference distribution

raw intensities x

biased background correction

s=E[S|data]

unbiased background correction

s=x-b

glog2(s|data)

log2(s)

?


Comparison between rma and vsn background correction
Comparison between RMA and VSN background correction followed by replacing ranks with values from a common reference distribution


Summaries for affymetrix genechip probe sets
Summaries for Affymetrix genechip probe sets followed by replacing ranks with values from a common reference distribution


Data and notation

PM followed by replacing ranks with values from a common reference distributionikg , MMikg= Intensities for perfect match and

mismatch probe k for gene g on chip i

i = 1,…, n one to hundreds of chips

k = 1,…, J usually 11 probe pairs

g= 1,…, G tens of thousands of probe sets.

Tasks:

calibrate (normalize) the measurements from different chips (samples)

summarize for each probe set the probe level data, i.e., 11 PM and MM pairs, into a single expression measure.

compare between chips (samples) for detecting differential expression.

Data and notation


Expression measures mas 4 0

Affymetrix GeneChip MAS 4.0 software used followed by replacing ranks with values from a common reference distributionAvDiff, a trimmed mean:

o sort dk = PMk -MMk

o exclude highest and lowest value

o K := those pairs within 3 standard deviations of the average

Expression measures: MAS 4.0


Expression measures mas 5 0

Instead of MM, use "repaired" version CT followed by replacing ranks with values from a common reference distribution

CT = MM if MM<PM

= PM / "typical log-ratio" if MM>=PM

Signal = Weighted mean of the values log(PM-CT)

weights follow Tukey Biweight function

(location = data median,

scale a fixed multiple of MAD)

Expression measures MAS 5.0


Expression measures li wong

dChip followed by replacing ranks with values from a common reference distribution fits a model for each gene

where

fi : expression measure for the gene in sample i

qk : probe effect

fi is estimated by maximum likelihood

Expression measures: Li & Wong


dChip followed by replacing ranks with values from a common reference distribution

RMA

biis estimated using the robust method median polish (successively remove row and column medians, accumulate terms, until convergence).

Expression measures RMA: Irizarry et al. (2002)


However median and hence median polish is not always so robust

See also: Casneuf T. et al. (2007), In situ analysis of cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

However, median (and hence median polish) is not always so robust...

- median

- trimmed mean (0.15)


Probe effect adjustment by using gdna reference
Probe effect adjustment by using gDNA reference cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

Huber et al., Bioinformatics 2006


Genechip cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461S. cerevisiae Tiling Array

4 bp tiling path over complete genome

(12 M basepairs, 16 chromosomes)

Sense and Antisense strands

6.5 Mio oligonucleotides

5 mm feature size

manufactured by Affymetrix

designed by Lars Steinmetz (EMBL & Stanford Genome Center)


RNA Hybridization cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461


Before normalization cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461


Probe specific response normali-zation cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

S/N

3.22

3.47

4.04

remove ‘dead’ probes

4.58

4.36


Probe-specific response normalization cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

siprobe specific response factor.

Estimate taken from DNA hybridization data

bi =b(si )probe specific background term. Estimation: for strata of probes with similar si, estimate b through location estimator of distribution of intergenic probes, then interpolate to obtain continuous b(s)


Estimation of cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461b: joint distribution of (DNA, RNA) values of intergenic PM probes

unannotated transcripts

log2 RNA intensity

b(s)

background

log2 DNA intensity


After normalization cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461


Quality assessment
Quality assessment cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461


Quality assessment and control

arrayQualityMetrics package by Audrey Kauffmann cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

Show example report

Quality Assessment and Control


References

Bioinformatics and computational biology solutions using R and Bioconductor, R. Gentleman, V. Carey, W. Huber, R. Irizarry, S. Dudoit, Springer (2005).

Variance stabilization applied to microarray data calibration and to the quantification of differential expression. W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, M. Vingron. Bioinformatics 18 suppl. 1 (2002), S96-S104.

Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. R. Irizarry, B. Hobbs, F. Collins, …, T. Speed. Biostatistics 4 (2003) 249-264.

Error models for microarray intensities. W. Huber, A. von Heydebreck, and M. Vingron. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).

Normalization and analysis of DNA microarray data by self-consistency and local regression. T.B. Kepler, L. Crosby, K. Morgan. Genome Biology. 3(7):research0037 (2002)

Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. S. Dudoit, Y.H. Yang, M. J. Callow, T. P. Speed. Technical report # 578, August 2000 (UC Berkeley Dep. Statistics)

A Benchmark for Affymetrix GeneChip Expression Measures. L.M. Cope, R.A. Irizarry, H. A. Jaffee, Z. Wu, T.P. Speed. Bioinformatics (2003).

....many, many more...

References


Acknowledgements

Anja von Heydebreck (Darmstadt) and Bioconductor,

Robert Gentleman (Seattle)

Günther Sawitzki (Heidelberg)

Martin Vingron (Berlin)

Rafael Irizarry (Baltimore)

Terry Speed (Berkeley)

Judith Boer (Leiden)

Anke Schroth (Wiesloch)

Friederike Wilmer (Hilden)

Jörn Tödling (Cambridge)

Lars Steinmetz (Heidelberg)

Audrey Kauffmann (Cambridge)

Acknowledgements


ad