170 likes | 269 Views
Resolving membership in a study in shared aggregate genetics data. David W. Craig, Ph.D. Investigator & Associate Director Neurogenomics Division dcraig@tgen.org. Genome-wide Association Studies.
E N D
Resolving membership in a study in shared aggregate genetics data David W. Craig, Ph.D.Investigator & Associate DirectorNeurogenomics Divisiondcraig@tgen.org
Genome-wide Association Studies • Genome-wide Association Studies (GWAS) genotype millions of Single Nucleotide Polymorphisms (SNPs) across 1000’s of individuals. • SNPs are typically biallic and diploid: • CC/CT/TT • 00/01/11 • Due to ancestral meiotic recombination, SNPs are not independent from neighboring variants. They are often in linkage disequilibrium. • The concept of LD means that a SNP may be associated with disease, due to underlying correlation with a different functional variant. • Summary stats for a SNP across hundreds/thousands of individuals: • 33% C / 77% T for cases and 45% C / 55% T • P=10-8 • CC=508 / CT=250 / TT= 108 • OR=1.8 Nature Reviews Genetics
Resolving Identity from aggregate genetics data • GWAS are expensive, requiring genotyping of 1000’s of individuals. • Often require consortiums of consortiums. • Sharing individual-level data was and is a challenge. • Sharing meta-data is a reasonable option. • In 2007, summary allele frequency and genotype counts were routinely placed on the web for all SNPs. • In 2008, after broad deliberation with the scientific community we published a forensics paper showing that one could have crude estimates of allele frequency, yet still resolve individuals. • Resolve is the term we purposely use. Identify has multiple meanings, particularly in GWAS study
Example Aggregate Data % A allele ~500 cases % A allele ~500 controls • rs903252 25% 26% • rs232323 15% 15% • rs323555 29% 29% • rs232343 73% 75% • rs233432 21% 22% • rs234312 5.1% 5.1% • rs163232 3.1% 2.8% • rs8392731 15% 16% • rs238764 7.3% 7.1% • rs383745 45% 54% Other SNP Aggregate Data Types: Genotypes, odds ratios, p-values, etc.
Visual example (SNP data as visualized) 250,000 pixels AA=1.0 AB=0.5 BB= 0
After merging, individual images still resolvable No Adjustment Auto Contrast & Smooth Filter
Conceptual Approach Directionalscore Reference Data Set Data Set of Question Person Of Interest SNP • Rs903252 25% 35% 100% +10 • Rs232323 15% 13% 50% -2 • Rs323555 29% 39% 100% +10 • Rs232343 73% 51% 0% +22 • Rs233432 21% 32% 100% +11 • Rs234312 5% 15% 50% +10 • Rs163232 3% 0% 0% +3 • ….. ….. ….. ….. …..
Equations (one approach of many!!) Directionalscore Reference Data Set Data Set of Question Person Of Interest SNP • Rs903252 25% 35% 100% +10 • Rs232323 15% 13% 50% -2 • Rs323555 29% 39% 100% +10 • Rs232343 73% 51% 0% +22 • Rs233432 21% 32% 100% +11 • Rs234312 5% 15% 50% +10 • Rs163232 3% 0% 0% +3 • ….. ….. ….. ….. ….. D = 9.1 sd(D) = 7.4 s= 7 T = D / ( sd(D)/√s) 3.2 = 9.1 / ( 7.4/√7 )
Impact • NIH policy was changed • Summary-level data is no longer freely available on the web in a distributed unrestrictive manner. • Additional papers refined the math and described limitations
Managing Risk • Distributing results of studies on human subjects inherently increases the the risk of a person being identifiable.. • Context is important. The concept of Positive Predictive Value (PPV) can provide a measure. • PPV can also account for ‘at-risk’ populations. • Currently, working with NIH on guidance for measuring risk with a given dataset • The approaches leveraged a critical concept of directionality, specific to genotype data and frequency tables. • P-values represent a fundamentally different datatype with low information content
The era of whole-genome sequencing is approaching • SNPs are common and usually defined as greater than 1% • Whole-genome sequencing and exome sequencing inherently measure rare variants. • Rare variants can be highly informative, particularly in combination. • Approaches need to be explored for summarizing results without revealing identity.
Acknowledgements • Lab • Jennifer Dinh • Szabolcs Szelinger • Holly Benson • Meredith Sanchez-Castillo • Brooke Hjelm • Informatics • Nils Homer, Ph.D. • Tyler Izatt • Jessica Aldrich • Alexis Christoforides • Ahmet Kurdoglu • James Long • Shripad Sinari • Funding • NINDS U24NS051872 • State of Arizona • NHGRI U01HG005210 • This work: ENDGAME (NHLBI U01 HL086528 )