1 / 20

High-dimensional data analysis: Microarrays and multiple testing

High-dimensional data analysis: Microarrays and multiple testing. Mark van de Wiel 1,2 1. Dep. of Mathematics, VU University Amsterdam 2. Dep. of Biostatistics & Dep. of Pathology, VU University medical center, Amsterdam. Genomics: a short history (1). Some history

maj
Download Presentation

High-dimensional data analysis: Microarrays and multiple testing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-dimensional data analysis: Microarrays and multiple testing Mark van de Wiel1,2 1. Dep. of Mathematics, VU University Amsterdam2. Dep. of Biostatistics & Dep. of Pathology, VU University medical center, Amsterdam

  2. Genomics: a short history (1) Some history Watson & Crick: double helix structure of DNA (1953) Source: http://ghr.nlm.nih.gov/handbook/illustrations/

  3. Genomics: a short history (2) 2. Human Genome Project: Identification of all 20.000-25.000 human genes (1990-2003) June 25, 2000 PRESIDENT CLINTON ANNOUNCES THE COMPLETION OF THE FIRST SURVEY OF THE ENTIRE HUMAN GENOME Hails Public and Private Efforts Leading to This Historic Achievement THE WHITE HOUSE Office of the Press Secretary For Immediate Release June 25, 2000 PRESIDENT CLINTON ANNOUNCES THE COMPLETION OF THE FIRST SURVEY OF THE ENTIRE HUMAN GENOME Hails Public and Private Efforts Leading to This Historic Achievement June 26, 2000 Today, at a historic White House event with British Prime Minister Tony Blair, President Clinton announced that the international Human Genome Project and Celera Genomics Corporation have both completed an initial sequencing of the human genome -- the genetic blueprint for human beings.

  4. Genomics: a short history (3) 3a. 1961 DNA hybridisation discovered 3b. 1994 Introduction of robotics (Hoheisel et al.) 3c. 1995 First microarray publication (Schena et al.) 3d. 1997 First whole genome microarray experiments (De Risi et al.) 3e. 1999 First publication on microarrays for cancer classification (Golub et al.): Leukemia / Affymetrix arrays

  5. Central dogma DNA is the same in each cell (tumours are an exception) Function of the cell is determined by proteins The path from DNA to proteins goes via messenger RNA (mRNA) DNA is transcribed to mRNA according to the needs of that cell mRNA contains the instructions for what proteins to build protein DNA mRNA Microarrays measure the amount of mRNA

  6. Microarrays (1) Source: http://research.yale.edu/ysm/ Source: http://www.cottongenomics.org/

  7. Microarrays (2) 1. Isolation of mRNA (single-stranded DNA; genes) 2. Labeling with color molecule 3. Chip contains probes which uniquely correspond to genes 4. Hybridization to the chip 5. Laser to read labeled molecules 6. Image analysis converts colors to numbers, intensities 7. Result: data matrix with 2 intensities for each array Microarray Movie

  8. The result • Nr of rows (eg 44.000) is determined by nr of probes (> nr of genes) • More genes than samples: high-dimensional setting

  9. Statistical issues before data analysis 1. Design of the experiment (not discussed) 2. Quality control (not discussed) 3. Normalization Data visualized by MA plot Use of different dyes (colours) may leed to a non-linear dye-bias This needs to be removed since it is artificial M = log2(R/G) = log2(R)-log2(G) A = log2(R*G)=log2(R)+log2(G)

  10. Normalization • Algorithm • Sort A values: A’1, ..., A’p. • For A’i, window Wi = [A’i – L, A’i + L] • For each Wi linearly regress: M = a + bA + ε • M’i(pred) = ai + bi A’i • Subtract M’i(pred) from M’i. Purpose: remove artificial dye effects to obtain unbiased M values. Most popular method: Loess. Assumption: mean M value equals 0 for all intensity ranges.

  11. Loess Before After

  12. After normalization Log2-ratios for further analysis. Ratios: cancel out experimental spot effect, log to obtain symmetric scale. However, nowadays log-intensities (both dyes) are used more and more often.

  13. Data • Type of response • Nominal. Eg tumor type. R = {Benigne, Maligne} • Ordinal. Stage of a tumor. R={1,2,3,4} • Continuous. Disease severity score. R = R+ • Censored. Survival. R= R+x {0,1}.

  14. Typical data analyses for microarrays (1) Multivariate UnsupervisedClustering Principle component analysis Classification (statistical learning, discriminant analysis, supervised clustering) Multivariate regression with penalty for overfitting (eg Lasso / Ridge regression) Prognostic multivariate survival models

  15. Typical data analyses for microarrays (2) Univariate Inference (Hypothesis testing). Expression of each gene is related to clinical response using, for example, ANOVA Linear Regression Cox regression (survival) Permutation (nonparametric) tests Hybrid Inference for sets of genes that are functionally related

  16. Two-step ANOVA (1) Indices a: array; c: condition; d: dye; g: gene (1) is the normalization model; it only includes a gene factor in the residual u. That is residual u contains all gene specific factors. (2) is the differential expression model

  17. Two-step ANOVA (2) Use of the two-step ANOVA: first fit (1) on all data, then estimate residuals u for each gene, then fit (2) for each gene separately. Main advantage with respect to one-level model: computational. One-level model would require fitting many parameters simultaneously in one ANOVA. Computation of raw p-values is the same as for usual ANOVA.

  18. Multiple Testing, Motivation. Histogram of 20.000 p-values generated under H0 Even when all 20.000 null-hypotheses are true, we expect 20.000*0.05 = 1.000 p-values smaller than α = 0.05!!!

  19. Multiple Testing. Illustration of Benjamini-Hochberg procedure

  20. Multiple Testing M

More Related