Selecting Initial GWAS and replication studies David Hunter Harvard School of Public Health

Selecting Initial GWAS and replication studies David Hunter Harvard School of Public Health Brigham and Women’s Hospital Broad Institute of MIT and Harvard

Initial Study for GWAS • Cases and controls well matched with respect to ancestry to minimize population stratification (restriction to one self-identified group) • Genomic control or other methods e.g. Eigenstrat (Price et al, 2006), may compensate for looser matching

Control of population stratificatione.g. hair colorin Nurses’ Health Study (European ancestry) Chi-squared inflation factors and Q-Q plots of –log10 p-values with no adjustment for population stratification and adjusting for the top four and fifty eigenvectors (Price et al, 2006) 45, 19 and 19 SNPs (respectively) with p<10-7 not shown Kraft P, unpublished

Article Nature 447, 661-678 (7 June 2007) | doi:10.1038/nature05911; Received 26 March 2007; Accepted 11 May 2007 Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls The Wellcome Trust Case Control Consortium

Conclusions Broad matching on ancestry and region adequate for discovery of strongest hits Statistical methods for control of population stratification (within populations of European ancestry) adequate to assist in discovery of strongest hits Will more rigorous designs permit discovery of weaker associations? When signal-noise is low, how does noise due to multiple comparisons compare with noise due to poor matching of controls? False negatives the biggest problem (can deal with false +ves via replication).

Criteria for follow-up of initial reports of genotype–phenotype associations Replication studies should be of sufficient sample size to convincingly distinguish the proposed effect from no effect Replication studies should preferably be conducted in independent data sets, to avoid the tendency to split one well-powered study into two less conclusive ones The same or a very similar phenotype should be analysed A similar population should be studied, and notable differences between the populations studied in the initial and attempted replication studies should be described Similar magnitude of effect and significance should be demonstrated, in the same direction, with the same SNP or a SNP in perfect or very high linkage disequilibrium with the prior SNP (r2 close to 1.0) Statistical significance should first be obtained using the genetic model reported in the initial study When possible, a joint or combined analysis should lead to a smaller P-value than that seen in the initial report A strong rationale should be provided for selecting SNPs to be replicated from the initial study, including linkage-disequilibrium structure, putative functional data or published literature Replication reports should include the same level of detail for study design and analysis plan as reported for the initial study Chanock, Maniolo et al. Nature, June 7th 2007

Initial Study for GWAS: technical issues • Standard advice – case and control samples handled exactly the same at every stage • Source of DNA • Blood/buffy coat mostly good results • Buccal cell variable results (Feigelson et al. CEBP, 2007 - encouraging) • Whole genome amplified DNA (Affy OK, Illumina in development)

Replication studies For statistical replication, prefer: • Similar phenotype • Similar ancestry For generalizability, prefer • Different populations • Different ancestry backgrounds (may also help with fine mapping)

Study design? Prospective • Protect from survivor bias • Protect from selection bias • Interpretability of gene-environment analyses • Possibility of interpretable biomarkers

Study quality? Importance depends on strength of signal • To date – little apparent relation between probability of replication and quality • May matter more for weak signals • Sample size may trump quality (within limits)

NCI BPC3 Results: 7909 cases, 8683 controls Rs1447295: Overall p, trend 4 x 10-19 Schumacher et al. Can Res, April 2007

a, rs2981582; b, rs3803662; c, rs889312; d, rs13281615; and e, rs3817198 FGFR2 Forest plots of the per-allele odds ratios for each of the five SNPs reaching genome-wide significance for breast cancer. Easton et al. Nature, May 2007

Cancer Genetic Markers of Susceptibility (CGEMS): http://cgems.cancer.gov

General Strategy for Multistage analysis of Prostate & Breast Cancer Initial GWAS Study 1150 cases/1150 controls 540,000 Tag SNPs Follow-up Study #1 4500 cases/ 4500 controls ~28,000 SNPs Follow-up Study #2 3500 cases/ 3500 controls at least1,500 SNPs 30 ±20 loci Fine Mapping

Committed Studies CGEMS Breast Cancer NHS (GWAS) PLCO WHI Polish C/C ACS EPIC MEC Prostate Cancer PLCO (GWAS) ACS HPFS PHS ATBC CeRePP EPIC MEC

CGEMS: caBIG Posting Pre-Computed Analysis Pre-computed Analysis No Restrictions Raw Genotype Case/control Age (in 5 yrs) Family Hx (+/-) Registration http://cgems.cancer.gov/data

Association Tests • Prostate 10/06 • Breast 04/07 • ~528,000 SNPs • Illumina 550k • Instant Replication! http://cgems.cancer.gov

Additional In silico replication possibilities dbGAP ncbi.nlm.nih.gov/dbgap Framingham nhlbi.nih.gov/about/framingham WTCCC wtccc.org.uk DGI broad.mit.edu/diabetes

Chromosomes 1 2 3 4 5 6 7 8 p q p q -2 -3 -4 -5 22 X 9 10 11 12 13 14 15 16 17 18 19 20 21 p q p q -2 Log10(p-value) -3 -4 FGFR2 -5 -6

The six SNPs with the smallest P values of the 528,173 tested among 1,145 cases of postmenopausal invasive breast cancer and 1,141 controls (full results available at http://cgems.cancer.gov ). • SNP ID Χ2* P* ORhet* ORhomo* Chromosome Gene • rs10510126 25.37 0.0000031 0.59 0.62 10 • rs1219648 23.56 0.0000076 1.24 1.81 10 FGFR2 • rs17157903 23.39 0.0000083 1.60 0.79 7 RELN • rs2420946 23.17 0.0000095 1.25 1.81 10 FGFR2 • rs7696175 22.40 0.0000137 1.38 0.86 4 TLR1,TLR6 • rs12505080 21.99 0.0000168 1.21 0.52 4 • *From analyses adjusting for age, matching factors (see Methods), and three eigenvectors of the principal components identified by Eigenstrat. P value obtained by a score test with 2df. Hunter et al, Nat Gen, May 2007

Scatterplot of P values for the FGFR2 locus from the GWAS.

Results of associations of rs1219648 in the Nurses Health Study, Nurses’ Health Study 2, and the PLCO study. • Study Population Allele Frequency ORhet ORhomo Ptrend • (N cases/N controls) Cases Controls (95% CI) (95% CI) • (%) (%) • Nurses’ Health Study (1,145/1,141) • 45.54 38.47 1.24 1.81 2.0 x 10-6 • (1.04-1.50) (1.43-2.31) • Nurses’ Health Study 2 (302/594) • 48.18 40.57 1.29 1.93 0.002 • (0.95-1.75) (1.31-2.86) • PLCO (919/922) 44.50 41.49 1.06 1.22 0.13 • (0.86-1.30) (0.94-1.58) • ACS CPS-II (555/556) 44.95 37.41 1.32 2.06 0.0002 • (1.02-1.72) (1.42-2.97) • Pooled estimates (2,921/3,213) 1.20 1.64 1.1 x 10-10 • (1.07-1.34) (1.42-1.90)

Results of associations of rs1219648 in the Nurses Health Study, Nurses’ Health Study 2, and the PLCO study. • Study Population Allele Frequency ORhet ORhomo Ptrend • (N cases/N controls) Cases Controls (95% CI) (95% CI) • (%) (%) • Nurses’ Health Study (1,145/1,141) • 45.54 38.47 1.24 1.81 2.0 x 10-6 • (1.04-1.50) (1.43-2.31) • Nurses’ Health Study 2 (302/594) • 48.18 40.57 1.29 1.93 0.002 • (0.95-1.75) (1.31-2.86) • PLCO (919/922) 44.50 41.49 1.06 1.22 0.13 • (0.86-1.30) (0.94-1.58) • ACS CPS-II (555/556) 44.95 37.41 1.32 2.06 0.0002 • (1.02-1.72) (1.42-2.97) • Pooled estimates (2,921/3,213) 1.20 1.64 1.1 x 10-10 • (1.07-1.34) (1.42-1.90) UNFINISHED AGENDA Where is the causal variant? What does this tell us about mechanisms of breast carcinogenesis?

THE HITS KEEP COMING…. UNFINISHED EPIDEMIOLOGIC/PUBLIC HEALTH AGENDA Gene-environment interaction, what do the genes tell us about environmental exposures? Gene-gene interaction Pathway analysis Clinical implications – risk stratification for screening? Intervention? Health policy implications? Much of the substrate data – publicly available or relatively cheap.

NHS/HPFS/PHS GENETIC STUDIES Immaculata De Vivo NHS/HPFS: Peter Kraft Sue Hankinson Hardeep Ranu Shelley Tworoger Crystal Arnone Eric Rimm Carolyn Guo Frank Hu Pati Soule Meir Stampfer Craig Labadie Walt Willett Carolyn Guo Frank Speizer Jiali Han Charles Fuchs Monica Macgrath Ed Giovannucci Chunyan He Andy Chan, Debra Patrick Dennett Schaumberg David Cox Fran Grodstein, Jae Tim Niu Hee Kang Aditi Hazra PHS: Jing Ma Fred Schumacher Mike Gaziano, P Ridker

Harvard cohorts EPIC cohorts ACS cohort Multiethnic Cohort PLCO cohort ATBC cohort BROAD INSTITUTE NCI Core Gen Facility CEPH NCI BPC3 STEERING COMMITTEE: Harvard David Hunter, Michael Gaziano, Julie Buring, Graham Colditz, Walter Willett EPIC,CEPH, Cambridge Elio Riboli, Rudolf Kaaks, Federico Canzian, Gilles Thomas, ACS Michael Thun, Heather Feigelson, Jeanne Calle NCI Richard Hayes, Demetrius Albanes, Bob Hoover, Stephen Chanock; Program - Mukesh Verma MEC & Broad Brian Henderson, Laurence Kolonel, David Altshuler, Malcolm Pike SECRETARIAT: David Hunter, Elio Riboli GENOMICS subgroup: David Altshuler (Chair) Steve Chanock Gilles Thomas Genotyping subgroup: Chris Haiman (Chair) Federico Canzian Alison Dunning Steve Chanock David Cox David Hunter Loic LeMarchand James Mackay STATISTICS subgroup: Dan Stram (Chair) Peter Kraft Rudolf Kaaks Paul Pharoah Malcolm Pike Gilles Thomas Shalom Wacholder PUBLICATIONS COMMITTEE: Michael Thun (Chair) Elio Riboli Brian Henderson David Hunter Graham Colditz Richard Hayes Demetrius Albanes

CGEMS Acknowledgements HSPH David Hunter Peter Kraft Fred Schumacher David Cox ACS Heather Feigelson Carmen Rodriguez Eugenia Calle Michael Thun PLCO Regina Ziegler Chris Berg Saundra Buys Chris MacCarty • NCI • Stephen Chanock • Gilles Thomas • Robert Hoover • Joseph Fraumeni • Daniela Gerhard • Kevin Jacobs • Zhaoming Wang • Meredith Yeager • Robert Welch • Richard Hayes • Sholom Wacholder • Nilanjan Chatterjee • Kai Yu • Margaret Tucker • Marianne Rivera-Silva • NCICB

Selecting initial and replication samples from existing studies I. What studies of the same phenotype exist? II. Can a consortium or collaborative approach provide a study with adequate power for the initial GWAS, along with pre-planned replication studies? III. Do any of these studies have pre-existing data that would increase power e.g. “free” controls for a prior GWAS of another phenotype? IV. Is the phenotype defined in the same or similar manner? V. Are covariate data available, and defined similarly? VI. Do any of the studies have additional phenotypic information e.g. biomarkers that would create opportunities for “added value” analyses, if these are the subjects of the GWAS?

Selecting Initial GWAS and replication studies David Hunter Harvard School of Public Health