1 / 64

Design and Analysis of Genome-wide Association Studies

Design and Analysis of Genome-wide Association Studies. David Evans. Methods of gene hunting. rare, monogenic (linkage). Effect Size. common, complex (association). Frequency. Historical gene mapping. Glazier et al, Science (2002). Too optimistic about sample size.

colum
Download Presentation

Design and Analysis of Genome-wide Association Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Analysis of Genome-wide Association Studies David Evans

  2. Methods of gene hunting rare, monogenic (linkage) Effect Size common, complex (association) Frequency

  3. Historical gene mapping Glazier et al, Science (2002).

  4. Too optimistic about sample size Linkage not powerful enough! Reasons for Failure Inadequate Marker Coverage (Candidate gene studies)

  5. Linkage Marker Gene1 Linkage disequilibrium Mode of inheritance Linkage Association Gene2 Complex Phenotype Individual environment Gene3 Common environment Polygenic background Reasons for Failure? Weiss & Terwilliger (2000) Nat Genet

  6. Enabling Genome-wide Association Studies HAPlotype MAP High throughput genotyping Large cohorts

  7. Wellcome Trust Case Control Consortium

  8. 8q24 ARTS 6q251 FTO 2q36 INS 18p11 12q24 8q24 2q35 LSP1 IL2RA FTO 8q24 Successes… MSMB Breast ca Prostate ca Colorectal ca Crohn’s Dis IBD T1D T2D Obesity Triglycerides CAD Coeliac Dis Rheumatoid A Ankylosing S Alzheimer Dis Glaucoma Asthma 10q21 TCF2 Common genes = common etiology? JAZF1 PTPN2 WFS1 IFIH1 Some in gene deserts CTBP2 KCNJ11 FAM92B CTLA4 Large relative risks does not = success HNF1B NKX2-3 TCF7L2 ERBB3 11q13 Drug targets NCF4 CD25 SLC30A8 3des. PTPN22 PHOX2B PPARG KIAAA035D NUDT11 IRGM HHEX SLC22A3 BSN IGF2BP2 MAPKI3 KLK3 ATG16L1 TNRC9 CDKAL1 IL21 5p13 IL23R LMTK2 SMAD7 9p21 FGFR2 NOD2 IL2 PTPN22 GALP LOXL1 ORMDL IL23R GCKR 9p21

  9. Study Design

  10. Case- Control Studies (Multiplicative model;r2 =1; RRAa = 1.2; α = 5 x 10-7)

  11. Most efficient ratio is 1:1 Sometimes difficult to recruit cases, in this situation power can still be increased by ascertaining controls In the hypothetical situation of an infinite number of controls, only half the number of cases would be required Most increase in power occurs when the number of controls is 3 - 5 times the number of cases Case to Control Ratio

  12. Early age of onset Family cases Quantitative traits- Extreme cases BUT must be careful… Other Strategies to Increase Power Minimize phenotypic heterogeneity (500 individuals taken from top and bottom; α = 5 x 10-7)

  13. Misclassifying cases is not the same as misclassifying controls. The effect of each depends on the prevalence of disease For example, for diseases where prevalence less than 10% much more important to ensure cases are truly affected than controls are really unaffected Use of historic controls (but note stratification; batch effects; platform differences) Phenotypic Misclassification Misclassification in psychiatric genetics Random misclassification should not affect type I error but will decrease power

  14. TDT vs Case Control p = 0.1; RAA = RAa = 2 Number of units similar for each => 2/3 Number of individuals for TDT Prevalence affects CC power but not TDT power

  15. Quantitative Traits Little power lost by analyzing families relative to singletons It may be efficient to genotype only some individuals in larger pedigrees Pedigrees allow error checking, within family tests, parent-of-origin analyses, joint linkage and association etc Visscher et al. (2008) EJHG

  16. Genotyping Platform

  17. Selecting Markers: Strategies common gene centric all common nonsynonymous rare common Frequency too difficult? agnostic genes coding Function Focus

  18. Some Commercial Alternatives… Affymetrix SNP array 5.0 (500K) Affymetrix SNP array 6.0 (1.8M) Illumina 317K -> Illumina 370K Illumina 550K -> Illumina 610K Illumina 1M Illumina Human Exon 510S Illumina Human NS_12 Beadchip (15K)

  19. Ideal tag sets 500,000 tags SNPs to tag all common variation in CEU at r2 > 0.8 Diminishing returns as coverage increases (e.g 250K tags 85% of genome) Linear relationship for “singleton” SNPs How many SNPs to tag the genome? Barrett & Cardon (2006) Nat Genet

  20. How Do The Chips Do? Anderson et al. (2008) Nature Genetics Some of the difference in coverage can be recovered through imputation If sample size limited, but funding not, use chip with best coverage If cost limited but sample size not use Illumina 300K? (Cost efficiency)

  21. Rare SNPs are not tagged well by common SNPs! Hapmap and SNP chips biased towards common variants Most SNPs are Rare

  22. What about nsSNP chips? Non-synonymous SNPs produce changes in amino acid sequence Most common nsSNPs tagged by existing genome-wide products Little to add to genome-wide chips in terms of identifying common variants May help identify rare variants of intermediate penetrance Evans et al. (2008) EJHG

  23. “Cleaning” Data

  24. Genotypes are not raw data Trade off between stringency and call rate (no universal value) Raw intensities of ALL putative associations should be checked!

  25. Hardy Weinberg Equilibrium Allele frequency Mendelian Inconsistencies SNP Quality Control Missing Data Rate (SNPs, Individuals, cases vs controls)

  26. Sample Heterozygosity

  27. Sample Gender

  28. Association Analysis

  29. SNP marker data can be represented in 2x3 table. Test of association where X2 has χ2 distribution with 2 degrees of freedom under null hypothesis. Genotypic tests • Sensitive to genotyping error • Often not as powerful as trend test

  30. Each individual contributes two counts to 2x2 table. Test of association where X2 has χ2 distribution with 1 degrees of freedom under null hypothesis. Assumes cases and controls in HWE Assumes multiplicative disease model Allele-based tests

  31. Logistic regression framework • Model case/control status within a logistic regression framework. • Let πi denote the probability that individual i is a case, given their genotype Gi. • Logit link function where

  32. Indicator variables • Represent genotypes of each individual by indicator variables:

  33. Likelihood calculations • Log-likelihood of case-control data given marker genotypes where yi = 1 if individual i is a case, and yi = 0 if individual i is a control. • Maximise log-likelihood over β parameters, denoted . • Models fitted using PLINK. • Additive model equivalent to Armitage test for trend

  34. Model comparison • Compare models via deviance, having a χ2 distribution with degrees of freedom given by the difference in the number of model parameters.

  35. Covariates • It is straightforward to incorporate covariates in the logistic regression model: • age, gender, and other environmental risk factors. • genotypes at unlinked markers to control for population stratification. • Generalisation of link function, e.g. for additive model: where Xij is the response of individual i to the jth covariate, and γj is the corresponding covariate regression coefficient.

  36. Controlling for Population Stratification

  37. Population structure Marchini, Nat Genet (2004)

  38. Genomic control 2 No stratification Test locus Unlinked ‘null’ markers 2 Stratification  adjust test statistic ‘λ’ is inflation factor (=1 if no inflation)

  39. QQ plots McCarthy et al. (2008) Nature Genetics

  40. Population structure -  Genomic control- genome-wide inflation of median test statistic

  41. Crohn’s collection center Center 3:  = 1.77 All others:  = 1.09

  42. Crohn’s Multidimensional Scaling

  43. Principal Components Analysis • Principal Components Analysis is a data reduction technique where many variables are reduced to a few “principal components”: • Each component describes as much variability as possible • Components are orthogonal and describe consecutively smaller proportions of the variance • First few components reflect population ancestry • Genotypes and phenotypes are adjusted by amounts attributable to ancestry along each component by computing residuals of linear regressions • Association statistics are computed using ancestry adjusted genotypes and phenotypes

  44. Geographic Interpretation North-west Europe South-east Europe

  45. Imputation

  46. Imputation Recombination Rate 1 1 0 1 1 0 1 0 1 1 0 1 1 0………. 1 1 0 1 1 0 1 0 1 1 0 1 1 0………. HapMap Phase II 1 1 0 1 1 0 1 0 1 1 0 1 1 0………. 1 1 0 1 1 0 1 0 1 1 0 1 1 0………. 2 1 1 2 ? 2 1 ?? 1 ? 2 2 0………. 2 1 1 2 ? 2 2 ?? 0 ? 2 1 0………. Cases 2 ? 1 2 ? 2 1 ?? 1 ? 1 1 0………. 2 1 2 1 ? 2 2 ?? 1 ? 1 1 0………. 2 1 1 1 ? 2 1 ?? 1 ? 2 2 0………. 2 1 1 1 ? 2 2 ?? 0 ? 2 2 0………. Controls 1 0 1 2 ? 2 1 ?? 1 ? 1 1 ?………. 2 1 2 1 ? 2 2 ?? 1 ?1 1 ?……….

  47. Imputation

  48. Interpretation and Prioritizing SNPs

  49. Asymptotic P values • “The probability of observing the test result or a more extreme value than the test result under the null hypothesis” • The p value is NOT the probability that the null hypothesis is true • The probability that the null/alternate hypothesis is true is a function of the evidence contained in the data (p value), the power of the test, and the prior probability that the association is true/false • The p value is a fluid measure of the strength of evidence against the null hypothesis that was designed to be interpreted in conjunction with other (pre-existing) evidence

More Related