From colored speckles to expression values. o quality control / optimization o calibration and error modeling o data transformations o software Wolfgang Huber Dep. of Molecular Genome Analysis DKFZ Heidelberg. samples: mRNA from tissue biopsies, cell lines. probes:
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
o quality control / optimizationo calibration and error modelingo data transformationso software
Dep. of Molecular Genome
Slide: 25x75 mm
4x4 or 8x4 sectors
17...38 rows and columns per sector
Spot-to-spot: ca. 150-350 mm
to one print-tip
sample: RNA (cDNA) hybridized to the array, aka target, mobile substrate.
probe: DNA spotted on the array, aka spot, immobile substrate.
sector: rectangular matrix of spots printed using the same print-tip (or pin)
plate: set of 384 (768) spots printed with DNA from the same microtitre plate of clones
channel: data from one color (Cy3 = cyanine 3 = green, Cy5 = cyanine 5 = red).
batch: collection of microarrays with the same probe layout and quality
1. Addressing. Estimate location of spot centers.
2. Segmentation. Classify pixels as foreground (signal) or background.
Main software from Affymetrix:
MAS - MicroArray Suite.
DAT file: Image file, ~107 pixels, ~50 MB.
CEL file: probe intensities, ~400k numbers
CDF file: Chip Description File. Describes which probes go in which probe sets (genes, gene fragments, ESTs).
DAT image files CEL files
Each probe cell: 10x10 pixels.
Gridding: estimate location of cell centers.
Remove outer 36 pixels 8x8 pixels.
Probe cell signal, PM or MM, is the 75th percentile of the 8x8 pixel values.
Background: Average of the lowest 2% probe cells is taken as the background value and subtracted.
Compute also quality values.
PMijg, MMijg= Intensity for perfect match and mismatch probe j for gene g in chip i.
i = 1,…, n one to hundreds of chips
j = 1,…, J usually 16 or 20 probe pairs
g= 1,…, G 8…20,000 probe sets.
calibrate (normalize) the measurements from different chips (samples)
summarize for each probe set the probe level data, i.e., 20 PM and MM pairs, into a single expression measure.
compare between chips (samples) for detecting differential expression.
n one-color arrays (Affymetrix, nylon)
two-color spotted arrays
The point is not that these steps are ‘not perfect’; it is that they may vary from array to array, experiment to experiment.
o similar effect on many
o corrections can be estimated from data
o too random to be ex-plicitely accounted for
Error modelSources of variation
amount of RNA in the biopsy
aik, bik all unknown: need to approximately determine from data
quantity of interest
No non-linear terms:
Saturation can be avoided in the experiments.
Data from well-performed experiments shows no evidence for gross deviations from affine linearity.
The flexibility (and complexity) lies in the number of parameters: twice the number of data points!
parameters must not all be independent
make simplifiying assumptions, reduce the number of independent parameters to manageable size
Quality control: verify that assumptions hold for the data at hand
Normalization, calibration: estimate parameters for the data at hand
Error modeling: control the error bars both of measured yik and of estimated normalization parameters aik,bik
log hik ~ N(0,s22)
ai per-sample offset
Lik local background provided by image analysis
eik ~ N(0, bi2s12)
A typical set of assumptions
Sequence-wise factors bkneed not be explicitely determined if only interested in relative expression levels
The assumptions bring down the number of parameters to 2*d - the rest is modeled as noise.
Array calibration terms ai, bi same for all probes on array - could extend to include print-tip or plate effects
Probe affinities bk same for all arrays - could extend to include batch effects
=parameter estimation of model parameters
o data heteroskedastic variance stabilizing data transformation
o maximum likelihood estimator
o model holds for genes that are unchanged; differentially transcribed genes act as outliers.
o heavy-tailed data distributions
robust variant of ML estimator, à la Least Trimmed Sum of Squares regression.
o works up to breakdown point of 50%
o average over multiple probes
e.g. Affymetrix GeneChip MAS 4.0: trimmed mean
o sort dj = PMj -MMj
o exclude highest and lowest value
o J := those pairs within 3 standard deviations of the average
Instead of MM, use "repaired" version CT
CT = MM if MM<PM
= PM / "typical log-ratio" if MM>=PM
Tukey Biweight: B(x) = (1 – x2/c2)2 if |x|<c, 0 otherwise
dChip fits a model for each gene
o estimate qi by maximum likelihood.
o need at least 10 or 20 chips.
Current version works with PMs only.
o Estimate one global background value b=mode(MM). Otherwise ignore MM.
o Assume: PM = Ytrue + b
Estimate Y0 from PM and b as a conditional expectation E[strue|PM, b].
o Use log2(Y).
o Nonparametric nonlinear calibration ('quantile normalization') across a set of chips.
AvDiff – type:
with A a set of “suitable” pairs.
Li-Wong – type:
robust estimation of ai: median polish (iterative: successively remove row and column medians and accumulate terms until the process stabilizes).
affy Affymetrix pre-processing
marrayInput two-color slides
vsn calibration and variance
o open-source („reproducible research“)
o based on R (biggest statistics library in the
verify the assumptions that underly the
visualization, diagnostic plots
some examples what can go wrong
usual assumption for Affymetrix arrays:
PM intensity =
+ specific part (from labeled sample cDNA)
From: R. Irizarry et al., Biostatistics 2002
Scatterplot, colored by PCR-plate
Two RZPD Unigene II filters (cDNA nylon membranes)
after delivery of 5x105 spots
after delivery of 3x105 spots
H. Sueltmann DKFZ/MGA
R Rb R-Rbcolor scale by rank
another array: print-tip
color scale ~ log(G)
color scale ~ rank(G)
spotted cDNA arrays, Stanford-type
arrays i=1…63; roughly sorted by time
(76,000 clones, RZPD Unigene-II filters)
Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. YH Yang, S Dudoit, P Luu, DM Lin, V Peng, J Ngai and TP Speed. Nucl. Acids Res. 30(4):e15, 2002.
Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression. W.Huber, A.v.Heydebreck, H.Sültmann, A.Poustka, M.Vingron. Bioinformatics, Vol.18, Supplement 1, S96-S104, 2002.
A Variance-Stabilizing Transformation for Gene Expression Microarray Data. BP Durbin, JS Hardin, DM Hawkins, DM Rocke. Bioinformatics, Vol.18, Suppl. 1, S105-110.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. RA Irizarry, B Hobbs, F Collin, YD Beazer-Barclay, KJ Antonellis, U Scherf, TP Speed (2002). Biostatistics. http://biosun01.biostat.jhsph.edu/~ririzarr/papers/index.html
A more complete list of references is in:
Elementary analysis of microarray gene expression data. W. Huber, A. von Heydebreck, M. Vingron, manuscript.
MPI Molekulare Genetik
Anja von Heydebreck