1 / 131

Carlo Colantuoni – ccolantu@jhsph

Summer Inst. Of Epidemiology and Biostatistics, 2008: Gene Expression Data Analysis 8:30am-12:30pm in Room W2017. Carlo Colantuoni – ccolantu@jhsph.edu. http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2008.htm. Class Outline.

booker
Download Presentation

Carlo Colantuoni – ccolantu@jhsph

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summer Inst. Of Epidemiology and Biostatistics, 2008:Gene Expression Data Analysis8:30am-12:30pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2008.htm

  2. Class Outline • Basic Biology & Gene Expression Analysis Technology • Data Preprocessing, Normalization, & QC • Measures of Differential Expression • Multiple Comparison Problem • Clustering and Classification • The R Statistical Language and Bioconductor • GRADES – independent project with Affymetrix data. http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2008.htm

  3. Class Outline - Detailed • Basic Biology & Gene Expression Analysis Technology • The Biology of Our Genome & Transcriptome • Genome and Transcriptome Structure & Databases • Gene Expression & Microarray Technology • Data Preprocessing, Normalization, & QC • Intensity Comparison & Ratio vs. Intensity Plots (log transformation) • Background correction (PM-MM, RMA, GCRMA) • Global Mean Normalization • Loess Normalization • Quantile Normalization (RMA & GCRMA) • Quality Control: Batches, plates, pins, hybs, washes, and other artifacts • Quality Control: PCA and MDS for dimension reduction • Measures of Differential Expression • Basic Statistical Concepts • T-tests and Associated Problems • Significance analysis in microarrays (SAM) [ & Empirical Bayes] • Complex ANOVA’s (limma package in R) • Multiple Comparison Problem • Bonferroni • False Discovery Rate Analysis (FDR) • Differential Expression of Functional Gene Groups • Functional Annotation of the Genome • Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum • Gene Set Enrichment Analysis (GSEA) • Parametric Analysis of Gene Set Enrichment (PAGE) • geneSetTest • Notes on Experimental Design • Clustering and Classification • Hierarchical clustering • K-means • Classification • LDA (PAM), kNN, Random Forests • Cross-Validation • Additional Topics • The R Statistical Language • Bioconductor • Affymetrix data processing example!

  4. DAY #2: • Intensity Comparison & Ratio vs. Intensity Plots • Log transformation • Background correction (Affymetrix, 2-color, other) • Normalization: global and local mean centering • Normalization: quantile normalization • Batches, plates, pins, hybs, washes, and other artifacts • QC: PCA and MDS for dimension reduction

  5. Microarray Data Quantification Log Intensity Log Intensity

  6. Microarray Data Quantification Log Ratio Log Intensity

  7. Logarithmic Transformation: if : logz(x)=y then : zy=x Logarithmmath refresher: log(x) + log(y) = log( x * y ) log(x) - log(y) = log( x / y )

  8. Intensity vs. Intensity: LINEAR Intensity Distribution: LINEAR

  9. Intensity vs. Intensity: LOG Intensity Distribution:LOG

  10. Intensity vs. Intensity: LINEAR

  11. Intensity vs. Intensity: LOG

  12. Microarray Data Quantification Int vs. Int:LINEAR Int vs. Int:LOG Ratio vs. Int: LOG

  13. Background Subtraction

  14. Before Hybridization Sample 1 Sample 2 Array 2 Array 1

  15. After Hybridization Array 2 Array 1

  16. More Realistic - Before Sample 1 Sample 2 Array 2 Array 1

  17. More Realistic - After Array 2 Array 1

  18. No label poly C

  19. Intensity distributions for the no-label and Yeast DNA

  20. Why Adjust for Background? The presence of background noise is clear from the fact that the minimum PM intensity is not 0 and that the geometric mean of the probesets with no spike-in is around 200 units.

  21. Why Adjust for Background? (E1 + B) ≈ E1 or … (E1 + B) / (E2 + B) ≈ E1 / E2 Local slope decreases as nominal concentration decreases! (E1 + B) ≈ B or … (E1 + B) / (E2 + B) ≈ 1 By using the log-scale transformation before analyzing microarray data, investigators have, implicitly or explicitly, assumed a multiplicative measurement error model (Dudoit et al., 2002; Newton et al., 2001; Kerr et al., 200; Wolfinger et al., 2001). The fact, seen in Figure 2, that observed intensity increase linearly with concentration in the original scale but not in the log-scale suggests that background noise is additive with non-zero mean. Durbin et al. (2002), Huber et al. (2002), Cui, Kerr, and Churchill (2003), and Irizarry et al. (2003a) have proposed additive-background-multiplicative-measurement-error models for intensities read from microarray scanners.

  22. Affymetrix GeneChip Design 5’ 3’ Reference sequence …TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT… GTACTACCCAGTCTTCCGGAGGCTA Perfectmatch (PM) Mismatch (MM) GTACTACCCAGTGTTCCGGAGGCTA NSB & SB NSB

  23. Why not subtract MM?

  24. Why not subtract MM?

  25. Why not subtract MM?

  26. Background: Solutions

  27. Affymetrix GeneChip Design 5’ 3’ Reference sequence …TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT… GTACTACCCAGTCTTCCGGAGGCTA Perfectmatch (PM) Mismatch (MM) GTACTACCCAGTGTTCCGGAGGCTA NSB & SB NSB

  28. Motivation: PM - MM The hope is that: PM = B + S MM = B PM – MM = S But this is not correct!

  29. Simulation • We create some feature level data for two replicate arrays • Then compute Y=log(PM-kMM) for each array • We make an MA using the Ys for each array • We make a observed concentration versus known concentration plot • We do this for various values of k. The following “movie” shows k moving from 0 to 1.

  30. k=0 Log2(Ratio) Observed level (log2) Log2(Intensity) Known level (log2)

  31. k=1/4 Log2(Ratio) Observed level (log2) Log2(Intensity) Known level (log2)

  32. k=1/2 Log2(Ratio) Observed level (log2) Log2(Intensity) Known level (log2)

  33. k=3/4 Log2(Ratio) Observed level (log2) Log2(Intensity) Known level (log2)

  34. k=1 Log2(Ratio) Observed level (log2) Log2(Intensity) Known level (log2)

  35. Real Data MAS 5.0 RMA

  36. RMA: The Basic Idea PM=B+S Observed: PM Of interest: S Pose a statistical model and use it to predict S from the observed PM

  37. The Basic Idea PM=B+S • A mathematically convenient, useful model • B ~ Normal (,) S ~ Exponential () • No MM • Borrowing strength across probes

  38. MAS 5.0

  39. RMA Notice improved precision but worse accuracy

  40. Problem • Global background correction ignores probe-specific NSB • MM have problems • Another possibility: Use probe sequence

  41. Probe-specific Background

  42. G-C content effect in PM’s Any given probe will have some propensity to non-specific binding. As described in Section 2.3 and demonstrated in Figure 3, this tends to be directly related to its G-C content. We propose a statistical model that describes the relationship between the PM, MM, and probes of the same G-C content. Boxplots of log intensities from the array hybridized to Yeast DNA for strata of probes defined by their G-C content. Probes with 6 or less G-C are grouped together. Probes with 20 or more are grouped together as well. Smooth density plots are shown for the strata with G-C contents of 6,10,14, and 18.

  43. General Model (GCRMA) NSB SB We can calculate: Due to the associated variance with the measured MM intensities we argue that one data point is not enough to obtain a useful adjustment. In this paper we propose using probe sequence information to select other probes that can serve the same purpose as the MM pair. We do this by defining subsets of the existing MM probes with similar hybridization properties.

  44. The MA plot shows log fold change as a function of mean log expression level. A set of 14 arrays representing a single experiment from the Affymetrix spike-in data are used for this plot. A total of 13 sets of fold changes are generated by comparing the first array in the set to each of the others. Genes are symbolized by numbers representing the nominal log2 fold change for the gene. Non-differentially expressed genes with observed fold changes larger than 2 are plotted in red. All other probesets are represented with black dots. The smooth lines are 3SDs away with SD depending on log expression.

  45. Another sequence effect in PM’s and MM’s Naef & Magnasco (2003), PHYSICAL REVIEW E 68, 011906, 2003

  46. Another sequence effect in PM’s and MM’s We show in Fig. 2 joint probability distributions of PMs and MMs, obtained from all probe pairs in a large set of experiments. Actually, two separate probability distributions are superimposed: in red, the distribution for all probe pairs whose 13th letter is a purine, and in cyan those whose 13th letter is a pyrimidine. The plot clearly shows two distinct branches in two colors, corresponding to the basic distinction between the shapes of the bases: purines are large, double ringed nucleotides while pyrimidines have smaller single rings. This underscores that by replacing the middle letter of the PM with its complementary base, the situation on the MM probe is that the middle letter always faces itself, leading to two quite distinct outcomes according to the size of the nucleotide. If the letter is a purine, there is no room within an undistorted backbone for two large bases, so this mismatch distorts the geometry of the double helix, incurring a large steric and stacking cost. But if the letter is a pyrimidine, there is room to spare, and the bases just dangle. The only energy lost is that of the hydrogen bonds. Naef & Magnasco (2003), PHYSICAL REVIEW E 68, 011906, 2003

  47. C and T are pyrimidines (and small), A and G are purines (and large).

  48. Why not subtract MM?

More Related