Henrik Bengtsson hb@maths.lth.se Mathematical Statistics, Centre for Mathematical Sciences,

Statistical analysis of microarray data Henrik Bengtsson hb@maths.lth.se Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden Bioinformatics 2, Bioinformatics Centre, University of Copenhagen. February 19, 2004

Outline Part I – (Very short) Background • Central Dogma of Biology • Idea behind the microarray technology • Experimental design Part II - Printing, Hybridization, Scanning & Image Analysis • From clone to slide • From samples to hybridization • From scanning to raw data Part III - Preprocessing of data (IMPORTANT) • Exploratory data analysis and log transforms • Background correction • Normalization Part IV - Identifying differentially expressed genes • Cut-off by log-ratios values, the t-statistics and cut-off by T values • Multiple testing, adjusting the p-values • Validation

The cDNA microarray technology-PART I: (Very short) Background • The Central Dogma of Biology • Idea behind the microarray technology • Experimental design

The Central Dogma of Biology Idea of microarrays (and other gene expression techniques): Measure the amount of mRNA to find genes that are expressed (active). (measuring protein might be better, but is currently much harder)

History cDNA microarrays have evolved from Southern & Northern blots, with clone libraries gridded out on nylon membrane filters being an important and still widely used intermediate. Things took off about 1995 with the introduction of non-porous solid supports, such as glass - these permitted miniaturization - and fluorescence based detection. Terry Speed et al.

The cDNA Microarray Technique • Put a large number of DNA sequences or synthetic DNA oligomers onto a glass slide • - 5000-50000 gene expressions at the same time. • Measure amounts of cDNA bound to each spot • Identify genes that behave differently in different cell populations- tumor cells vs healthy cells- brain cells vs liver cells- same tissue different organisms • Time series experiments- gene expressions over time after treatment, e.g. T=5,10,30,... min vs T=0. • ...

Experimental Design • Experimental design is very important.Example: Assume we have 8 control (C1, ... C8) and 8 reference (R1, ..., R8) samples, should we comparea) C1-R1, ..., C8-R8, or b) C1-R’, ..., C8-R’ where R’ is a pool of all R, or c) something else? • Wrong design can be very wrong and the correct can be very efficient and require less arrays (= less lab work and cheaper experiments!). • Statistical theory exists and is helpful. Recently, statisticans has started to look at the microarray design problem in more detail. Progress will for be made sure! Experience from previous setups is also of great value. • We will not discuss this problem anymore in this talk.

The cDNA microarray technology-PART II: Printing, Hybridization, Scanning & Image Analysis • From clone to slide • From samples to hybridization • From scans to raw data

cDNA clones excitation red laser green laser PCR product amplification purification emission Reference sample Test sample printing RNA RNA cDNA cDNA overlay images Hybridize Overview scanning Production data: (Rfg,Gfg,Rbg,Gbg, ...)

Printing / spotting Arrayer (approx 100,000 EUR)

Microarray slide preparation J. Vallon-Christersson,Dept Oncology, Lund Univ.

Terminology: probe and target • As defined in Nature 1999: • The probes are the immobilized DNA sequences spotted on the array, i.e. spot, oligo, immobile substrate. • The targets are the labeled cDNA sequences to be hybridized to the array, i.e. mobile substrate. The opposite usage can also be seen in some references. However, think of probes as the measuring device (which you can buy), and the targets (that you provide) as what you want to measure.

Reference sample Tumor sample RNA RNA cDNA cDNA Hybridize RNA extraction & hybridization (targets) • Extract mRNA from samples • Reverse transcription of mRNA to cDNA in presence of aminoallyl-dUTP (aa-dUTP) • Label with Cy3 or Cy5 fluorescent dye (links to aa-dUTP) • Hybridize labeled cDNA to slide • Wash slide Preparation incl. array (approx 100-200 EUR/array) (probes) Figure: Hybridization chamber.

References Original cDNA microarray paper: • Mark Schena, Dari Shalon, Ronald W. Davis, and Patrick O. Brown. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270(5235):467–470, October 1995. General: • Mark Schena. Microarrays Analysis. John Wiley & Sons, Inc., Hoboken, New Jersey, 2003. • David J. Duggan, Michael Bittner, Yidong Chen, and Paul Meltzer & Jeffrey M. Trent. Expression profiling using cDNA microarrays. Nature Genetics, 21(1 Supplement):10–14, January 1999.

Scanning Terry Speed et al.

Software (approx 0-5,000 EUR) Scanner (approx 50,000-100,000 EUR)

excitation red laser green laser emission overlay images Two-channel scanning  higher frequency, more energy  lower frequency, less energy

Combined color image for visualization

Image Analysis Terry Speed et al.

Signal quantification • AddressingLocate spot centers. • SegmentationClassification of pixels either as signal or background (using circles, seeded region growing or other). • Signal quantificationa) foreground estimatesb) background estimatesc) ... (shape, size etc) Terry Speed et al.

Does one size fit all? Terry Speed et al.

Segmentation and quantification of spot regions Seeded region growing Fixed-sized or adaptive-sized circles Inside the boundaries is foreground (the spots) and outside is the background. We estimate both... Terry Speed et al.

Robust signal estimates:mean vs median • When a statistician talks about a robust measure he/she normally means a measure that is not sensitive to outliers, e.g.x = (8, 85, 7, 9, 5, 4, 13, 6, 8) • The mean of all x’s, i.e. (x1+x2+...+xK)/K, will be affected if one of the measures, say x2, is wrong;mean(x) = 16.11 • The median of all x’s, i.e. the middle value of (x1+x2+...+xK), will not be affected if a “few” (< 50%) are wrong; median(x)= 8.0 For this reason we prefer to use median instead of the mean in cases were we expect artifacts. (If there are a lot of measurements and the errors are symmetrically distributed the median will give the same result as the mean without outliers.)

Typical data output

cDNA clones excitation red laser green laser PCR product amplification purification emission Reference sample Test sample printing RNA RNA cDNA cDNA overlay images Hybridize Summary scanning Production data: (Rfg,Gfg,Rbg,Gbg, ...)

References Image analysis: • Yee Hwa Yang, Michael Buckley, Sandrine Dudoit, and Terry Speed. Comparison of methods for image analysis on cDNA microarray data. Technical Report 584, Department of Statistics, University of California at Berkeley, Nov 2000. • Anders Bengtsson. Microarray image analysis: Background estimation using region and filtering techniques. Master’s Theses in Mathematical Sciences, Mathematical Statistics, Centre for Mathematical Sciences, Lund Institute of Technology, Sweden, December 2003. 2003:E40.

The cDNA microarray technology-PART III: Preprocessing of data (IMPORTANT) • Exploratory Data Analysis and log-transforms • Background correction • Normalization

Exploratory Data Analysis Terry Speed et al.

Scatter plot: R vs G non-differentially expressed genes along the diagonal: R = G up-regulated genes Most genes have low gene expression levels. What happens here? down-regulated genes “Observed” data {(R,G)}n=1..5184: R = signal in red channel, G = signal in green channel

Scatter plot: log2R vs log2G non-differentially expressed genes are still along the diagonal: log2R = log2G up-regulated genes Low gene expression levels are “blown up”. down-regulated genes “Observed” data {(log2R,log2G)}n=1..5184: R = signal in red channel, G = signal in green channel

Scatter plot: M vs A (recommended) up-regulated genes Note: M vs A is basically a rotation of the log2R vs log2G scatter plot. non-differentially expressed genes are now along the horizontal line: M = 0  log2R - log2G = 0 Why: Now the quantity of interest, i.e. the fold change, is contained in one variable, namely M! If M > 0, up-regulated. If M < 0, down-regulated. down-regulated genes Transformed data {(M,A)}n=1..5184: M = log2(R) - log2(G) (minus) A = ½·[log2(R) + log2(G)] (add)

Details on M vs A Log-ratios: M = log2(R) – log2(G) = [logarithmic rules] = log2(R/G) Average log-intensities: A = ½·[log2(R) + log2(G)] = [logarithmic rules] = ½·log2(R·G) There is a one-to-one relationship between (M,A) and (R,G): R=(22A+M)1/2, G=(22A-M)1/2

More on why log and why M vs A? • It makes the distribution symmetric around zero  • Logs stretch out region we are most interested in and makes the distribution more normal.  • Easier to see artifacts of the data, .e.g. as intensity dependent variation and dye-bias.  Before: After: • Log base 2 because the raw data is binary data (max intensity is 216-1 = 65535). It is also naturally to think of 2-, 4-, 8-fold etc up and down regulated genes. For the actual analysis, any log-base will do.

Summary of (R,G)  (log2R,log2G)  (M,A) log(R) vs log(G) M vs A R vs G R= red channel signalG = green channel signal M = log2(R/G) (log-ratio), A = ½·log2(R·G) (log-intensity) plot(rg, log=FALSE); plot(rg); ma <- as.MAData(rg); plot(ma)

Density plots of all signals Example: Signals in the red channel seem to be slightly weaker than the signals in the green channel, compare with M vs A plot: ...more green. log2R= red channel signallog2G = green channel signal plotDensity(rg)

Spatial plot of log-ratios (M values) Print-tip 1 Print-tip 8 Print-tip 9 Print-tip 16 plotSpatial(ma)

Print-tip box plot of log-ratios 1 16 boxplot(ma, groupBy=“printtip”)

Printing order of spots 1 Hmm... why the horizontal stripes?  16 6384 spots printed onto 9 slides in total 399 print turns using 4x4 print-tips...

Print-order plot of log-ratios The spots are order according to when they were spotted/dipped onto the glass slide(s). Note that it takes hours/days to print all spots on all arrays. plotPrintorder(ma)

Print-dip plot of log-ratios Average values of the 16 log-ratios at each dip from each of the 399 print turns. plotDiporder(ma)

References Exploratory data analysis for microarrays: • Yee Hwa Yang, Sandrine Dudoit, Percy Luu, and Terence P Speed. Normalization for cDNA microarray data. In Michael L. Bittner, Yidong Chen, Andreas N. Dorsel, and Edward R. Dougherty, editors, Proceedings of SPiE, volume 4266 of Microarrays: Optical Technologies and Informatics, pages 141–152, San Jose, California, June 2001. The International Society for Optical Engineering. • Henrik Bengtsson. Identification and normalization of plate effects in cDNA microarray data. Preprints in Mathematical Sciences 2002:28, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2002. • Gordon Smyth and Terry Speed, METHODS: Selecting Candidate Genes from DNA Array Screens, Dec 2003

Background correction and background estimates revisited... Terry Speed et al.

Alt. 1: Local background estimates One for the red and one for the green channel. Axon GenePix Pro - size of bg region depends on size of the spot. Give too much variation! Fixed-size regions - same number of background pixels give same variation for all estimate. A. Bengtsson, Lund University, 2003.

Alt. 2: Filtered background estimates One for the red and one for the green channel. In general, filter by moving a window (mask) over the image: 1) At each position replace the value in the center (red dot) by, say, the median pixel intensity of all spots in the window. 2) Move to next pixel and so on. 3) Result: (more sophisticated filters than the median filter are used in practice) Idea: With filters we do not have to know the where the spots are. More robust. Also known as: Morphological filters A. Bengtsson, Lund University, 2003.

Background subtraction • The foreground region of the spots are assumed to be contaminated with background noise/signal. • Sources of background: probe unspecifically sticking on slide, irregular / dirty slide surface, dust, noise in the scanner measurement. • From the image analysis we have foreground and background estimates for each spot in both channels: (Rfg, Rbg, Gfg, Gbg) • For each spot on the slide we do background subtraction Red intensity: R = Rfg – Rbg Green intensity: G = Gfg - Gbg • Not considered: real cross-hybridisation and unspecific hybridisation to the probe. Still not known if this is a real problem. Example of an extremely bad array. Terry Speed et al.

Which background estimate should we use? Axon GenePix Pro - size of bg region depends on size of the spot. Give too much variation! Filtered estimates - no spot segmentation. Fixed-size regions - same number of background pixels give same variation for all estimate. Fixed-sized are better than variable-sized region methods, but filtered estimates may be even better. Should we do background correction at all? (A. Bengtsson, 2003) is a good start, but more research is needed.

Background correction or not? background subtracted non-background subtracted M = log2(Rfg/Gfg) Mbg = log2({Rfg-Rbg} / {Gfg-Gbg}) -seems better, but...

Background problem still not solved! Foreground & background Background subtracted GenePix background: -Curvature in the other direction now! Too much background correction?! Spotbackground:(morphological opening) -Still curvature left. Too little background correction?!

References Background estimation and correction: • Yee Hwa Yang, Michael Buckley, Sandrine Dudoit, and Terry Speed. Comparison of methods for image analysis on cDNA microarray data. Technical Report 584, Department of Statistics, University of California at Berkeley, Nov 2000. • Anders Bengtsson. Microarray image analysis: Background estimation using region and filtering techniques. Master’s Theses in Mathematical Sciences, Mathematical Statistics, Centre for Mathematical Sciences, Lund Institute of Technology, Sweden, December 2003. 2003:E40. • Henrik Bengtsson, Göran Jönsson, and Johan Vallon-Christersson. Calibration and assessment of channel-specific biases in microarray data with extended dynamical range. Preprints in Mathematical Sciences 2003:37, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2003. • Charles Kooperberg, Thomas G. Fazzio, Jeffrey J. Delrow, and Thoshio Tsukiyama. Improved background correction for spotted DNA microarrays. Journal of Computational Biology, 9:55–66, 2002.

Normalization Terry Speed et al.

Henrik Bengtsson hb@maths.lth.se Mathematical Statistics, Centre for Mathematical Sciences,

Henrik Bengtsson hb@maths.lth.se Mathematical Statistics, Centre for Mathematical Sciences,

Presentation Transcript

Henrik Bengtsson hb@maths.lth.se (MSc Computer Science, PhD candidate in Statistics) Mathematical Statistics Centre for

MATHEMATICAL SCIENCES

Henrik Bengtsson hb@maths.lth.se Mathematical Statistics, Centre for Mathematical Sciences

Henrik Bengtsson hb@maths.lth.se Mathematical Statistics, Centre for Mathematical Sciences

Mathematical Statistics

Mathematical Statistics

MATHEMATICAL SCIENCES

MATHEMATICAL SCIENCES

NSF Division of Mathematical Sciences Statistics Program

Henrik Bengtsson hb@maths.lth.se Mathematical Statistics Centre for Mathematical Sciences

MATHEMATICAL SCIENCES

MATHEMATICAL SCIENCES

MATHEMATICAL SCIENCES

MATHEMATICAL SCIENCES

MATHEMATICAL SCIENCES

MATHEMATICAL SCIENCES

MATHEMATICAL SCIENCES

Mathematical Statistics

African Institute for Mathematical Sciences

Centre for Mathematical Sciences Cambridge 5-16 September 2011

MATHEMATICAL SCIENCES

MATHEMATICAL SCIENCES