1 / 27

Detecting Differentially Expressed Genes

Detecting Differentially Expressed Genes. Pengyu Hong 09/13/2005. Background (Microarray). Extract RNA. Cells. Background. Extract RNA. Cells. Background. Extract RNA. Cells. Background. Extract RNA. Cells. Background. Extract RNA. Cells. 10 4 + genes. Background. Extract RNA.

abrial
Download Presentation

Detecting Differentially Expressed Genes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

  2. Background (Microarray) Extract RNA Cells

  3. Background Extract RNA Cells

  4. Background Extract RNA Cells

  5. Background Extract RNA Cells

  6. Background Extract RNA Cells 104+ genes

  7. Background Extract RNA Cells 104+ genes

  8. Background Extract RNA Cells 104+ genes

  9. biological variability technical variability Background Biological sample • RNA extraction (total RNA or mRNA) • Amplification (in vitro transcription) • Label samples • Hybridization • Washing and staining • Microarrays are highly noisy • Use replicated experiments to make inferences about differential expression for the population from which the biological samples originate Scanning

  10. Background Normalization Calculate Gene Expression Index

  11. An Example 5 normal sample and 9 myeloma (MM) samples 12558 genes (rows)

  12. Genes of Interest • Statistical significance: that the observed differential expression is unlikely to be due to chance. • Scientific significance: that the observed level of differential expression is of sufficient magnitude to be of biological relevance.

  13. Parametric Test: t-test Statistical significance in the two group problem Group 1 (N samples): X1, X2, … XN Group 2 (M samples): Y1, Y2, … YM Assume Xi ~ Normal (μ1, σ2) Yj ~ Normal (μ2, σ2) Null hypothesis: Group 1 is the “same” to Group 2 (i.e., μ1= μ2)

  14. Parametric Test: t-test Statistical significance in the two group problem Xi ~ Normal (μ1, σ2) Yj ~ Normal (μ2, σ2) Null hypothesis:μ1= μ2 Test null hypothesis with test statistics:

  15. Xi ~ Normal (μ1, σ12) σ1 σ2 If variances are unequal Yj ~ Normal (μ2, σ22) (1) When N+M > 30, this is approximately normal (2) When 1 >> 2, this is approximately t(df = N–1) (3) In general, Welch approximation: t’ ~ t(df’), where

  16. Wilcoxon rank sum test Consider row 7 of MM study 16 253 633 1008 708 36 72 28 14 33 19 49 58 23 13 4 3 1 2 8 5 10 14 9 12 7 6 11 --------------------------- rank sum = 23 This test is more appropriate than the t-tests when the underlying distribution is far from normal. (But it requires large group sizes)

  17. P-value • p-value = P(|T|>|t|) is calculated based on the distribution of T under the null hypothesis. • p-value is a function of the test statistics and can be viewed as a random variable. • e.g. p-value = 2(1 - F(|t*|), F = cdf of t(N+M – 2). • A small p-value represents evidence against the null hypothesis  differentially expressed in our case.

  18. Permutation test • A non-parametric way of computation p-value for any test statistics. • In the MM-study, each gene has (14 choose 5) = 2002 different test values obtainable from permuting the group labels. • Under the null hypothesis that the distribution for the two groups are identical, all these test values are equally probable. What is the probability of getting a test value at least as extreme as the observed one? This is the permutation p-value.

  19. Permutation technique Compute TS0 Compute TS1 Compute TS2 Compute TS3 The set of TSi form the empirical distribution of the test statistic TS

  20. Scientific Significance • Fold change FC = • May not be high when statistical significance is high. • Not an appropriate measure if the dispersion is not taken into consideration.

  21. Conservative fold change Conservative fold change (CFC) = Max (25th percentile of sample 1 / 75th percentile of sample 2, 25th percentile of sample 2 / 75th percentile of sample 1)

  22. Sample 1: Normal (100, 1) Sample 2: Normal (103, 1) CFC = 1.0164

  23. CFC=2.89 CFC=3.53 CFC=1.45 CFC=1.07

  24. P-values and FC contains different information

  25. Gene Selection and Ranking • A high threshold of statistical significance  Select genes with p-values smaller than a threshold • The selected genes are ordered according to their scientific significance (i.e. ranked by fold-changes)

  26. The False Positive Rate (FPR) • If we select genes with p-value < 0.01, then the probability of making a positive call when the gene is in fact not differential is less than 0.01. Thus selection by p-value controls the FPR. • However, if we have 12,000 genes in a microarray, then a FPR = 0.01 still allows up to 120 false positives. To make sensible decision, we must take multiple comparisons into consideration.

  27. Dealing with Multiple Comparison • Bonferroni inequality: To control the family-wise error rate for testing m hypotheses at level α, we need to control the FPR for each individual test at α/m • Then P(false rejection at least one hypothesis) < α or P(no false rejection) > 1- α • This is appropriate for some applications (e.g. testing a new drug versus several existing ones), but is too conservative for our task of gene selection.

More Related