1 / 103

Carlo Colantuoni – ccolantu@jhsph

Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017. Carlo Colantuoni – ccolantu@jhsph.edu. http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2009.htm. Class Outline.

fauna
Download Presentation

Carlo Colantuoni – ccolantu@jhsph

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summer Inst. Of Epidemiology and Biostatistics, 2009:Gene Expression Data Analysis8:30am-12:00pm in Room W2017 Carlo Colantuoni – ccolantu@jhsph.edu http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2009.htm

  2. Class Outline • Basic Biology & Gene Expression Analysis Technology • Data Preprocessing, Normalization, & QC • Measures of Differential Expression • Multiple Comparison Problem • Clustering and Classification • The R Statistical Language and Bioconductor • GRADES – independent project with Affymetrix data. http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2009.htm

  3. Class Outline - Detailed • Basic Biology & Gene Expression Analysis Technology • The Biology of Our Genome & Transcriptome • Genome and Transcriptome Structure & Databases • Gene Expression & Microarray Technology • Data Preprocessing, Normalization, & QC • Intensity Comparison & Ratio vs. Intensity Plots (log transformation) • Background correction (PM-MM, RMA, GCRMA) • Global Mean Normalization • Loess Normalization • Quantile Normalization (RMA & GCRMA) • Quality Control: Batches, plates, pins, hybs, washes, and other artifacts • Quality Control: PCA and MDS for dimension reduction • Measures of Differential Expression • Basic Statistical Concepts • T-tests and Associated Problems • Significance analysis in microarrays (SAM) [ & Empirical Bayes] • Complex ANOVA’s (limma package in R) • Multiple Comparison Problem • Bonferroni • False Discovery Rate Analysis (FDR) • Differential Expression of Functional Gene Groups • Functional Annotation of the Genome • Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum • Gene Set Enrichment Analysis (GSEA) • Parametric Analysis of Gene Set Enrichment (PAGE) • geneSetTest • Notes on Experimental Design • Clustering and Classification • Hierarchical clustering • K-means • Classification • LDA (PAM), kNN, Random Forests • Cross-Validation • Additional Topics • The R Statistical Language • Bioconductor • Affymetrix data processing example!

  4. DAY #3: • Measures of Differential Expression: • Review of basic statistical concepts • T-tests and associated problems • Significance analysis in microarrays (SAM) • (Empirical Bayes) • Complex ANOVA’s (“limma” package in R) • Multiple Comparison Problem: • Bonferroni • FDR • Differential Expression of Functional Gene Groups • Notes on Experimental Design

  5. Slides from Rob Scharpf

  6. Fold-Change?T-Statistics? Some genes are more variable than others

  7. Slides from Rob Scharpf

  8. Slides from Rob Scharpf

  9. Slides from Rob Scharpf

  10. Slides from Rob Scharpf

  11. distribution of distribution of Slides from Rob Scharpf

  12. Slides from Rob Scharpf

  13. X1-X2 is normally distributed if X1 and X2 are normally distributed – is this the case in microarray data? Slides from Rob Scharpf

  14. Problem 1: T-statistic not t-distributed. Implication: p-values/inference incorrect

  15. P-values by permutation • It is common that the assumptions used to derive the statistics are not approximate enough to yield useful p-values (e.g. when T-statistics are not T distributed.) • An alternative is to use permutations.

  16. p-values by permutations We focus on one gene only. For the bth iteration, b = 1,  , B; Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls”. For each gene, calculate the corresponding two sample t-statistic, tb. After all the B permutations are done: p = # { b: |tb| ≥ |tobserved| } / B This does not yet address the issue of multiple tests!

  17. The volcano plot shows, for a particular test, negative log p-value against the effect size (M). Another problem with t-tests

  18. Remember this?

  19. Problem 2: t-statistic bigger for geneswith smaller standard error estimates.Implication: Ranking might not be optimal

  20. Problem 2 • With low N’s SD estimates are unstable • Solutions: • Significance Analysis in Microarrays (SAM) • Empirical Bayes methods and Stein estimators

  21. Significance analysis in microarrays (SAM) • A clever adaptation of the t-ratio to borrow information across genes • Implemented in Bioconductor in the siggenes package Significance analysis of microarrays applied to the ionizing radiation response, Tusher et al., PNAS 2002

  22. SAM d-statistic • For gene i : mean of sample 1 mean of sample 2 Standard deviation of repeated measurements for gene i Exchangeability factor estimated using all genes

  23. Minimize the average CV across all genes.

  24. Scatter plots of relative difference (d) vs standard deviation (s) of repeated expression measurements A fix for this problem: Relative difference for a permutation of the data that was balanced between cell lines 1 and 2. Random fluctuations in the data, measured by balanced permutations (for cell line 1 and 2)

  25. SAM produces a modified T-statistic (d), and has an approach to the multiple comparison problem.

  26. Selected genes:Beyond expected distribution

  27. eBayes: Borrowing Strength • An advantage of having tens of thousands of genes is that we can try to learn about typical standard deviations by looking at all genes • Empirical Bayes gives us a formal way of doing this • “Shrinkage” of variance estimates toward a “prior”: moderated t-statistics – eliminates extreme stats due to small variances. • Implemented in the limma package in R. In addition, limma provides methods for more complex experimental designs beyond simple, two-sample designs.

  28. The Multiple Comparison Problem (some slides courtesy of John Storey)

  29. Hypothesis Testing • Test for each gene: Null Hypothesis: no differential expression. • Two types of errors can be committed • Type I error or false positive (say that a gene is differentially expressed when it is not, i.e., reject a true null hypothesis). • Type II error or false negative (fail to identify a truly differentially expressed gene, i.e.,fail to reject a false null hypothesis)

  30. Once you have a given score for each gene, how do you decide on a cut-off? p-values are most common. How do we decide on a cut-off when we are looking at many 1000’s of “tests”? Are 0.05 and 0.01 appropriate? How many false positives would we get if we applied these cut-offs to long lists of genes? Hypothesis testing

  31. Multiple Comparison Problem • Even if we have good approximations of our p-values, we still face the multiple comparison problem. • When performing many independent tests, p-values no longer have the same interpretation.

  32. Bonferroni Procedure a = 0.05# Tests = 1000a = 0.05 / 1000 = 0.00005orp = p * 1000

  33. Bonferroni Procedure Too conservative.How else can we interpret many 1000’s of observed statistics?Instead of evaluating each statistic individually, can we assess a list of statistics: FDR (Benjamini & Hochberg 1995)

  34. FDR • Given a cut-off statistic, FDR gives us an estimate of the proportion of hits in our list of differentially expressed genes that are false. Null = Equivalent Expression; Alternative = Differential Expression

  35. False Discovery Rate • The “false discovery rate” measures the proportion of false positives among all genes called significant: • This is usually appropriate because one wants to find as many truly differentially expressed genes as possible with relatively few false positives • The false discovery rate gives an estimate of the rate at which further biological verification will result in dead-ends

  36. Distribution of Statistics N=90 Permuted Observed Statistic

  37. Distribution of Statistics False Pos. Total Pos. = FDR = Permuted Observed Permuted Observed Statistic

  38. Distribution of p-values N=90 Observed Permuted p-value

  39. FDR = False Positives/Total Positive CallsThis FDR analysis requires enough samples in each condition to estimate a statistic for each gene: observed statistic distribution.And enough samples in each condition to permute many times and recalculate this statistic: null statistic distribution.What if we don’t have this?

  40. FDR = 0.05 Beyond ±0.9

  41. FDR = 0.05 Beyond ±0.9

  42. False Positive Rate versus False Discovery Rate • False positive rate is the rate at which truly null genes are called significant • False discovery rate is the rate at which significant genes are truly null

  43. False Positive Rate and P-values • The p-value is a measure of significance in terms of the false positive rate (aka Type I error rate) • P-value is defined to be the minimum false positive rate at which the statistic can be called significant • Can be described as the probability a truly null statistic is “as or more extreme” than the observed one

  44. False Discovery Rate and Q-values • The q-value is a measure of significance in terms of the false discovery rate • Q-value is defined to be the minimum false discovery rate at which the statistic can be called significant • Can be described as the probability a statistic “as or more extreme” is truly null

  45. Power and Sample Size Calculations are Hard • Need to specify: • a (Type I error rate, false positives) or FDR • s (stdev: will be sample- and gene-specific) • Effect size (how do we estimate?) • Power (1-b, b=Type II error rate) • Sample Size • Some papers: • Mueller, Parmigiani et al. JASA (2004) • Rich Simon’s group Biostatistics (2005) • Tibshirani. A simple method for assessing sample sizes in microarray experiments. BMC Bioinformatics. 2006 Mar 2;7:106.

  46. Beyond Individual Genes: Functional Gene Groups • Borrow statistical power across entire dataset • Integrate preexisting biological knowledge • Beyond threshold enrichment

More Related