Create Presentation
Download Presentation

Download

Download Presentation

COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS

111 Views
Download Presentation

Download Presentation
## COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR RELATIONS**BETWEEN GENES, DISEASES, AND POPULATIONS Eran Halperin November 10, 2009**Genetic**Factors Complexdisease Environmental Factors Multiple genes may affect the disease. Therefore, the effect of every single gene may be negligible.**Each chromosome ‘is’ a sequence over**the alphabet {A,G,C,T} (base pairs) Copy from mother ………ACCAGGACGA…… ………ACCAGGACGA…… Copy from father**Facts about our genome**• 23 pairs of chromosomes. • X and Y are the sex chromosomes (XX for women, XY for men). • 3,300,000,000 base pairs in the human genome**The Human Genome Project**“What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.” “But our work previously has shown… that having one genetic code is important, but it's not all that useful.” “I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…” Washington, DC June, 26, 2000**The Vision of Personalized Medicine**Genetic and epigenetic variants + measurable environmental/behavioral factors would be used for a personalized treatment and diagnosis**Example: Warfarin**An anticoagulant drug, useful in the prevention of thrombosis.**Example: Warfarin**Warfarin was originallyused as rat poison. Optimal dose variesacross the population Genetic variants (VKORC1 and CYP2C9) affect the variation of the personalized optimal dose.**Association Studies**Genetic variants such as Single Nucleotide Polymorphisms (SNPs) are tested for association with the trait.**Where should we look first?**SNP= Single Nucleotide Polymorphism person 1: ….AAGCTAAATTTG…. person 2: ….AAGCTAAGTTTG…. person 3: ….AAGCTAAGTTTG…. person 4: ….AAGCTAAATTTG…. person 5: ….AAGCTAAGTTTG…. Each common SNP has only two possible letters (alleles).**Associated SNP (high Relative Risk)**Disease Association Studies SNP= Single Nucleotide Polymorphism Cases: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC Associated SNP (lower Relative Risk) Controls: AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC**Preliminary Definitions**• SNP – single nucleotide polymorphism. A genetic variant which may carry different ‘value’ for different individuals. • Allele – the variant’s value: A,G,C, or T. • Most SNPs are bi-allelic. There are only two observed alleles in the populations. • Risk allele – the allele which is more common in cases than in controls (denoted R) • Nonrisk allele – the allele which is more common in the controls (denoted N)**Relative Risk**Chances of developing type II diabetes: 30% Risk=G Chances of developing type II diabetes: 20% Nonrisk=A Relative Risk: Pr(D|R)/Pr(D|N) = 1.5**Other Structural Variants**Inversion Deletion Copy number variant**Published Genome-Wide Associations through 6/2009, 439**published GWA at p < 5 x 10-8 NHGRI GWA Catalog www.genome.gov/GWAStudies**HapMap**Phase 2 5,000,000+ SNPs 600,000,000+ genotypes TSC Data Nucleic Acids Research 35,000 SNPs 4,500,000 genotypes Perlegen Data Science 1,570,000 SNPs 100,000,000 genotypes NCBI dbSNP Genome Research 3,000,000 SNPs 286,000,000 genotypes Daly et al. Nature Genetics 103 SNPs 40,000 genotypes Gabriel et al. Science 3000 SNPs 400,000 genotypes 2001 2002 2003 2004 2005 2006 Public Genotype Data Growth**Chance or Real Association?**Cases: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC Associated SNP (lower Relative Risk) Controls: AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC**How does it work?**• For every SNP we can construct a contingency table:**Hypothesis testing**• Null hypothesis: Pr(R|case) = Pr(R|control) • Alternative hypothesis: Pr(R|case) ≠ Pr(R|control) • The model assumes that all individuals are independent (unrelated), and therefore our sample is a random sample from a Binomial distribution • Cases sampled from distribution X~B(n,Pr(R|cases)) • Controls sampled from distribution Y~B(n,Pr(R|controls))**Hypothesis testing, cont.**• When n is large, B(n,p) ~ N(np, np(1-p)). • Under the null hypothesis:**P-value**• Z is called a test-statistic (z-score in this case). • We can calculate Z* for our data, and then calculate (using the normal approximation):p-value = Pr(|Z| > |Z*|) • Often we take , which is**The curse of dimensionality – corrections of multiple**testing • In a typical Genome-Wide Association Study (GWAS), we test millions of SNPs. • If we set the p-value threshold for each test to be 0.05, by chance we will “find” about 5% of the SNPs to be associated with the disease. • This needs to be corrected.**Bonferroni Correction**• If the number of tests is n, we set the threshold to be 0.05/n. • A very conservative test. If the tests are independent then it is reasonable to use it. If the tests are correlated this could be bad: • Example: If all SNPs are identical, then we lose a lot of power; the false positive rate reduces, but so does the power.**Challenge 1**Population Substructure**Population Substructure**• Imagine that all the cases are collected from Africa, and all the controls are from Europe. • Many association signals are going to be found • The vast majority of them are false; Why ??? Different evolutionary forces: drift, selection, mutation, migration, population bottleneck.**Evolution Theory**• Mutations add to genetic variation • Natural Selection controls the frequency of certain traits and alleles • Genetic drift**Mutations**AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGA AGAGCAGTCCACAGGTATAGCCTACATGAGATCGACATGAGA Estimated probability of a mutation in a single generation is 10^-8**Other ‘mutations’ - recombination**Copy 1 Copy 2 Probability ri (~10^-8) for recombination in position i. child chromosome**Natural Selection**• Example: being lactose telorant is advantageous in northern Europe, hence there is positive selection in the LCT gene different allele frequencies in LCT**Genetic Drift**• Even without selection, the allele frequencies in the population are not fixed across time. • Consider the case where we assume Hardy-Weinberg Equilibrium (HWE), that is, individuals are mating randomly in the population. • If at the first generation the allele frequencies are p0 (of a) and q0=1-p0 (of A). • Under HWE, E[pk+1]=pk, but V[pk+1] > 0, so the next generation will have pk+1≠p0.**The rate of the drift**• N – effective population size (if all individuals are entirely unrelated than N is the total population size). • Under an assumption of constant population size, if Xk counts the number of occurrences of a at generation k, then Xk+1 ~ B(N,pk). • E[pk+1] = E[Xk+1]/N = pk. • Var[pk+1] = pk(1-pk)/N. • The effect of genetic drift depends on the time and the effective populations size. Small population increases the effect.**Bottleneck effect**Effective population size Time Genetic drift’s rate is higher.**The Wright-Fisher Model**Generation 1 Allele frequency 1/9**The Wright-Fisher Model**Generation 2 Allele frequency 1/9**The Wright-Fisher Model**Generation 3 Allele frequency 1/9**The Wright-Fisher Model**Generation 4 Allele frequency 1/3**Ancestral population**migration**different allele frequencies**Ancestral population Genetic drift**Population Substructure**• Imagine that all the cases are collected from Africa, and all the controls are from Europe. • Many association signals are going to be found • The vast majority of them are false; What can we do about it?**Principal Component Analysis**• Dimensionality reduction • Based on linear algebra (Singular Value Decomposition) • Intuition: find the ‘most important’ features of the data – project the data on the axis with the largest variance.**Principal Component Analysis**Plotting the data on a onedimensional line for which the spread is maximized.**Principal Component Analysis**• In our case, we want to look at two dimensions at a time. • The original data has many dimensions – each SNP corresponds to one dimension.