Loading in 5 sec....

Microarray normalization, error models, qualityPowerPoint Presentation

Microarray normalization, error models, quality

- By
**jens** - Follow User

- 87 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Microarray normalization, error models, quality' - jens

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

*

*

*

*

Oligonucleotide microarraysGeneChip

Hybridized Probe Cell

Target - single stranded

cDNA

Oligonucleotide probe

5µm

Millions of copies of a specific

oligonucleotide probe molecule per patch

1.28cm

up to 6.5 Mio

different probe patches

Image of array after hybridisation and staining

Each target molecule (transcript) is represented by several oligonucleotides of (intended) length 25 bases

Probe: one of these 25-mer oligonucleotides

Perfect match (PM): A probe exactly complementary to the target sequence

Mismatch (MM): same as PM but with a single homomeric base change for the middle (13th) base (G-C, A-T)

Probe-pair: a (PM, MM) pair.

Probe set: a collection of probe-pairs (e.g. 11) targeting the same transcript

MGED/MIAME: „probe“ is ambiguous!

Reporter: the sequence

Feature: a physical patch on the array with molecules intended to have the same reporter sequence (one reporter can be represented by multiple features)

Terminology for transcription arraysImage analysis oligonucleotides of (intended) length 25 bases

- about 100 pixels per feature
- segmentation
- summarisation into one number representing the intensity level for this feature
- CEL file

samples: oligonucleotides of (intended) length 25 bases

mRNA from

tissue biopsies,

cell lines

fluorescent detection of the amount of sample-probe binding

arrays:

probes = gene-specific DNA strands

tissue A

tissue B

tissue C

ErbB2

0.02

1.12

2.12

VIM

1.1

5.8

1.8

ALDH4

2.2

0.6

1.0

CASP4

0.01

0.72

0.12

LAMA4

1.32

1.67

0.67

MCAM

4.2

2.93

3.31

marray dataWhy do you need ‘normalisation’? oligonucleotides of (intended) length 25 bases

b oligonucleotides of (intended) length 25 basesias

From: lymphoma dataset

vsn package

Alizadeh et al., Nature 2000

log oligonucleotides of (intended) length 25 bases2 intensity

arrays / dyes

oligonucleotides of (intended) length 25 basesratio compression

nominal 3:1

nominal 1:1

nominal 1:3

Yue et al., (Incyte Genomics) NAR (2001) 29 e41

A complex measurement process lies between mRNA concentrations and intensities

The problem is less that these steps are ‘not perfect’; it is that they vary from array to array, experiment to experiment.

Why do you need statistics? concentrations and intensities

tumor-normal concentrations and intensities

log-ratio

Which genes are differentially transcribed?same-same

Basic dogma of data analysis concentrations and intensities

Can always increase sensitivity on the cost of specificity, or vice versa, the art is to

- optimize both, then

- find the best trade-off.

X

X

X

X

X

X

X

X

X

concentrations and intensities How to compare microarray intensities with each other?

How to address measurement uncertainty (“variance”)?

How to calibrate (“normalize”) for biases between samples?

QuestionsSystematic concentrations and intensities

Stochastic

o similar effect on many

measurements

o corrections can be estimated from data

o too random to be ex-plicitely accounted for

o remain as “noise”

Calibration

Error model

Sources of variationamount of RNA in the biopsy

efficiencies of

-RNA extraction

-reverse transcription

-labeling

-fluorescent detection

probe purity and length distribution

spotting efficiency, spot size

cross-/unspecific hybridization

stray signal

Quantile normalisation concentrations and intensities

Ben Bolstad 2001 concentrations and intensities

Quantile normalisationdata("Dilution") concentrations and intensities

nq = normalize.quantiles(exprs(Dilution))

nr = apply(exprs(Dilution), 2, rank)

for(i in 1:4) plot(nr[,i], nq[,i], pch=".", log="y", xlab="rank",

ylab="quantile normalized", main=sampleNames(Dilution)[i])

Quantile normalisation is: per array rank-transformation followed by replacing ranks with values from a common reference distribution

+ followed by replacing ranks with values from a common reference distribution Simple, fast, easy to implement

+ Always works, needs no user interaction / tuning

+ Non-parametric: can correct for quite nasty non-linearities (saturation, background) in the data

- Always "works", even if data are bad / inappropriate

- Conservative: rank transformation looses information - may yield less power to detect differentially expressed genes

- Aggressive: when there is an excess of up- (or down) regulated genes, it removes not just technical, but also biological variation

Quantile normalisationloess normalisation followed by replacing ranks with values from a common reference distribution

loess followed by replacing ranks with values from a common reference distribution(locally weighted scatterplot smoothing): an algorithm for robust local polynomial regression by W. S. Cleveland and colleagues (AT&T, 1980s) and handily available in R

"loess" normalisationLocal polynomial regression followed by replacing ranks with values from a common reference distribution

bandwidth b

Robust regression followed by replacing ranks with values from a common reference distribution

C. Loader followed by replacing ranks with values from a common reference distribution

Local Regression and Likelihood

Springer Verlag

loess normalisation followed by replacing ranks with values from a common reference distribution

- local polynomial regression of M against A
- 'normalised' M-values are the residuals

before

after

local polynomial regression normalisation in >2 dimensions followed by replacing ranks with values from a common reference distribution

n followed by replacing ranks with values from a common reference distribution-dimensional local regression model for microarray normalisation

An algorithm for fitting this robustly is described (roughly) in the paper. They only provided software as a compiled binary for Windows. The method has not found much use.

Estimating relative expression followed by replacing ranks with values from a common reference distribution(fold-changes)

3000 followed by replacing ranks with values from a common reference distribution

3000

x3

?

1500

200

1000

0

?

x1.5

A

A

B

B

C

C

But what if the gene is “off” (below detection limit) in one condition?

ratios and fold changesFold changes are useful to describe continuous changes in expression

followed by replacing ranks with values from a common reference distributionratios and fold changes

The idea of the log-ratio (base 2)

0: no change

+1: up by factor of 21 = 2

+2: up by factor of 22 = 4

-1: down by factor of 2-1 = 1/2

-2: down by factor of 2-2 = ¼

A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.

What about a change from 0 to 500?

- conceptually

- noise, measurement precision

Many data are measured in definite units: followed by replacing ranks with values from a common reference distribution

time in seconds

lengths in meters

energy in Joule, etc.

Climb Mount Plose (2465 m) from Brixen (559 m) with weight of 78 kg, working against a gravitation field of strength 9.81 m/s2 :

What is wrong with microarray data?(2465 - 559) · 78 · 9.81 m kg m/s2

= 1 458 433 kg m2 s-2

= 1 458.433 kJ

vsn normalisation followed by replacing ranks with values from a common reference distribution

Robust affine regression normalisation ( followed by replacing ranks with values from a common reference distributionn-dim.) with the additive-multiplicative error model

Error model:

Trey Ideker et al. (2000) JCB

David Rocke and Blythe Durbin:

(2001) JCB, (2002) Bioinformatics

Use for robust affine regression normalisation:

W. Huber, Anja von Heydebreck et al. (2002) Bioinformatics

b followed by replacing ranks with values from a common reference distributioni per-sample

gain factor

bk sequence-wise

probe efficiency

hik multiplicative noise

ai per-sample offset

eik additive noise

The two component model

measured intensity = offset + gain true abundance

“multiplicative” noise followed by replacing ranks with values from a common reference distribution

“additive” noise

The two-component modelraw scale

log scale

B. Durbin, D. Rocke, JCB 2001

followed by replacing ranks with values from a common reference distributionParameterization

two practically equivalent forms (h<<1)

followed by replacing ranks with values from a common reference distributionvariance stabilizing transformations

Xu a family of random variables with E Xu = u, Var Xu = v(u). Define

Var f(Xu ) independent of u

derivation: linear approximation

followed by replacing ranks with values from a common reference distributionvariance stabilizing transformation

f(x)

x

1.) constant variance (‘additive’) followed by replacing ranks with values from a common reference distribution

2.) constant CV (‘multiplicative’)

3.) offset

4.) additive and multiplicative

variance stabilizing transformations followed by replacing ranks with values from a common reference distributionthe “glog” transformation

- - - f(x) = log(x)

———hs(x) = asinh(x/s)

P. Munson, 2001

D. Rocke & B. Durbin, ISMB 2002

W. Huber et al., ISMB 2002

generalized followed by replacing ranks with values from a common reference distribution

log-ratio

difference

log-ratio

variance:

constant part

proportional part

glog

raw scale

log

glog

difference red-green followed by replacing ranks with values from a common reference distribution

rank(average)

“usual” log-ratio followed by replacing ranks with values from a common reference distribution

'glog' (generalized log-ratio)

c1, c2are experiment specific parameters (~level of background noise)

followed by replacing ranks with values from a common reference distributionVariance Bias Trade-Off

Estimated log-fold-change

log

glog

Signal intensity

followed by replacing ranks with values from a common reference distributionVariance-bias trade-off and shrinkage estimators

Shrinkage estimators:

a general technology in statistics:

pay a small price in bias for a large decrease of variance, so overall the mean-squared-error (MSE) is reduced.

Particularly useful if you have few replicates.

Generalized log-ratio is a shrinkage estimator for fold change

Background correction followed by replacing ranks with values from a common reference distribution

Background correction followed by replacing ranks with values from a common reference distribution

0 pm

750 fm

1 pm

500 fm

Irizarry et al. Biostatistics 2003

Background correction: followed by replacing ranks with values from a common reference distributionRMA, Irizarry et al. (2002)

Background correction: followed by replacing ranks with values from a common reference distribution

raw intensities x

biased background correction

s=E[S|data]

unbiased background correction

s=x-b

glog2(s|data)

log2(s)

?

Comparison between RMA and VSN background correction followed by replacing ranks with values from a common reference distribution

Summaries for Affymetrix genechip probe sets followed by replacing ranks with values from a common reference distribution

PM followed by replacing ranks with values from a common reference distributionikg , MMikg= Intensities for perfect match and

mismatch probe k for gene g on chip i

i = 1,…, n one to hundreds of chips

k = 1,…, J usually 11 probe pairs

g= 1,…, G tens of thousands of probe sets.

Tasks:

calibrate (normalize) the measurements from different chips (samples)

summarize for each probe set the probe level data, i.e., 11 PM and MM pairs, into a single expression measure.

compare between chips (samples) for detecting differential expression.

Data and notationAffymetrix GeneChip MAS 4.0 software used followed by replacing ranks with values from a common reference distributionAvDiff, a trimmed mean:

o sort dk = PMk -MMk

o exclude highest and lowest value

o K := those pairs within 3 standard deviations of the average

Expression measures: MAS 4.0Instead of MM, use "repaired" version CT followed by replacing ranks with values from a common reference distribution

CT = MM if MM<PM

= PM / "typical log-ratio" if MM>=PM

Signal = Weighted mean of the values log(PM-CT)

weights follow Tukey Biweight function

(location = data median,

scale a fixed multiple of MAD)

Expression measures MAS 5.0dChip followed by replacing ranks with values from a common reference distribution fits a model for each gene

where

fi : expression measure for the gene in sample i

qk : probe effect

fi is estimated by maximum likelihood

Expression measures: Li & WongdChip followed by replacing ranks with values from a common reference distribution

RMA

biis estimated using the robust method median polish (successively remove row and column medians, accumulate terms, until convergence).

Expression measures RMA: Irizarry et al. (2002)

See also: Casneuf T. et al. (2007), In situ analysis of cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

However, median (and hence median polish) is not always so robust...- median

- trimmed mean (0.15)

Probe effect adjustment by using gDNA reference cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

Huber et al., Bioinformatics 2006

Genechip cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461S. cerevisiae Tiling Array

4 bp tiling path over complete genome

(12 M basepairs, 16 chromosomes)

Sense and Antisense strands

6.5 Mio oligonucleotides

5 mm feature size

manufactured by Affymetrix

designed by Lars Steinmetz (EMBL & Stanford Genome Center)

RNA Hybridization cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

Before normalization cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

Probe specific response normali-zation cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

S/N

3.22

3.47

4.04

remove ‘dead’ probes

4.58

4.36

Probe-specific response normalization cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

siprobe specific response factor.

Estimate taken from DNA hybridization data

bi =b(si )probe specific background term. Estimation: for strata of probes with similar si, estimate b through location estimator of distribution of intergenic probes, then interpolate to obtain continuous b(s)

Estimation of cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461b: joint distribution of (DNA, RNA) values of intergenic PM probes

unannotated transcripts

log2 RNA intensity

b(s)

background

log2 DNA intensity

After normalization cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

Quality assessment cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

arrayQualityMetrics package by Audrey Kauffmann cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics 2007;8(1): 461

Show example report

Quality Assessment and ControlBioinformatics and computational biology solutions using R and Bioconductor, R. Gentleman, V. Carey, W. Huber, R. Irizarry, S. Dudoit, Springer (2005).

Variance stabilization applied to microarray data calibration and to the quantification of differential expression. W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, M. Vingron. Bioinformatics 18 suppl. 1 (2002), S96-S104.

Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. R. Irizarry, B. Hobbs, F. Collins, …, T. Speed. Biostatistics 4 (2003) 249-264.

Error models for microarray intensities. W. Huber, A. von Heydebreck, and M. Vingron. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).

Normalization and analysis of DNA microarray data by self-consistency and local regression. T.B. Kepler, L. Crosby, K. Morgan. Genome Biology. 3(7):research0037 (2002)

Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. S. Dudoit, Y.H. Yang, M. J. Callow, T. P. Speed. Technical report # 578, August 2000 (UC Berkeley Dep. Statistics)

A Benchmark for Affymetrix GeneChip Expression Measures. L.M. Cope, R.A. Irizarry, H. A. Jaffee, Z. Wu, T.P. Speed. Bioinformatics (2003).

....many, many more...

ReferencesAnja von Heydebreck (Darmstadt) and Bioconductor,

Robert Gentleman (Seattle)

Günther Sawitzki (Heidelberg)

Martin Vingron (Berlin)

Rafael Irizarry (Baltimore)

Terry Speed (Berkeley)

Judith Boer (Leiden)

Anke Schroth (Wiesloch)

Friederike Wilmer (Hilden)

Jörn Tödling (Cambridge)

Lars Steinmetz (Heidelberg)

Audrey Kauffmann (Cambridge)

Acknowledgements
Download Presentation

Connecting to Server..