1 / 20

Statistical Methods for Rare Variant Association Test Using Summarized Data

Statistical Methods for Rare Variant Association Test Using Summarized Data. Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics. Next generation sequencing => rare variants Two types of data. Motivation. Summarized level. Pooled DNA sequencing

sheri
Download Presentation

Statistical Methods for Rare Variant Association Test Using Summarized Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics

  2. Next generation sequencing => rare variants Two types of data Motivation Summarized level • Pooled DNA sequencing • Public data (as control) Individual level

  3. Single-variant test (Regular GWAS) Models for Individual-level Data Collective/group test Burden/collapsing test

  4. Methods for Individual-level Data CMC(Li and Leal, 2008) WSS (Madsen and Browning, 2009) VT (Price et al, 2010) aSum (Han and Pan, 2010) KBAC (Liu and Leal, 2010) RBT(Ionita-Laza et al, 2011) PWST (Zhang et al, 2011) SKAT( Wu et al, 2011) EREC( Lin et al, 2011) … 4

  5. Methods for Summarized Data

  6. An Example of Summarized Data 6 Jonathan C. Cohen, et al. Science 305, 869 (2004)

  7. An Example of Summarized Data (cont.)

  8. EFTTFT QQ Plots of Existing Methods(under the null) EFT and C-alpha inflated with false positives TFT and CAST no inflation, but need to assume single direction of effects Objective More general, non-inflated, powerful methods … CAST C-alpha

  9. variant 1 variant 2 … … variant 3 variant k variant i Structure of Summarized data Strategy Instead of testing total freq./number, we test the randomness of all tables.

  10. Exact Probability Test (EPT) 1.Calculating the probability of each table based on hypergeometric distribution 2. Calculating the logarized joint probability (L) for multiple tables 3. Enumerating all possible table combinations and L scores 4. Calculating p-value P= Prob.( )

  11. Likelihood Ratio Test (LRT) Binomial distribution Maximum likelihood estimation

  12. EPT N=500 LRT N=500 Q-Q Plots of EPT and LRT(under the null) LRT N=3000 EPT N=3000

  13. Power Comparison significance level=0.00001 Simulation Logistic model N=500, 1000,3000 50% cases 50% controls 5-15 variants MAF<1% Positive causal: 80% (OR=2 to 6) Neutral :20% Negative causal:0% Power Power Power Sample size Sample size Sample size

  14. Power Comparisonsignificance level=0.00001 Simulation Logistic model N=500, 1000,3000 50% cases 50% controls 5-15 variants MAF<1% Positive causal: 60% (OR=2 to 6) Neutral :20% Negative causal:20% (OR=1/6 to 1/2) Power Sample size

  15. Power Comparison significance level=0.00001 Simulation Logistic model N=500, 1000,3000 50% cases 50% controls 5-15 variants MAF<1% Positive causal: 40% (OR=2 to 6) Neutral :20% Negative causal:40% (OR=1/6 to 1/2) Power Sample size

  16. Cases: 460 ovarian cancer cases, germline exome data, from TCGA Controls: ~3500 individuals from the NHBLI exome project Application -LOG10 p-values of 933 cancer-related genes

  17. Individual-level Data Based Methods vs. Summarized Data Based Methods An interesting question: If we have individual-level data, but we choose to perform summarized data based analysis, will there be any power gain or loss? 17

  18. Power Comparison individual-level data vs. summarized dataN=1000, significance level=0.00001 Power Individual-level data based methods: CMC Li & Leal, 2008 SKAT Wu et al., 2011 Variant proportion positive : neutral : negative (%)

  19. Conclusions EFT and C-alpha produce inflated p-value. TFT and CAST produce correct p-value, but lose power in detecting bi-directional effects. EPT produces correct p-value and maintains power regardless of effect directions, more computer time. LRT produces slightly biased p-value for small N, can be improved by larger N, similar power of EPT, less computer time, a good alternative for large datasets. If no confounders need to be modeled, there is no significant loss of power in the use of summarized data (This study has not bee published)

  20. Acknowledgements Dr. Li Ding Charles Lu Krishna-Latha Kanchi (for providing the TCGA and NHBLI exome data)

More Related