1 / 39

SNP and Haplotype Analysis Algorithms and Applications

SNP and Haplotype Analysis Algorithms and Applications. Eran Halperin International Computer Science Institute Berkeley, California. “Computational Genetics”. The Human Genome Project.

gilmore
Download Presentation

SNP and Haplotype Analysis Algorithms and Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SNP and Haplotype Analysis Algorithms and Applications Eran Halperin International Computer Science Institute Berkeley, California CPM 2006

  2. “Computational Genetics” CPM 2006

  3. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.” “But our work previously has shown… that having one genetic code is important, but it's not all that useful.” (referring to comparative genomics). “I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…” Washington, DC June, 26, 2000. CPM 2006

  4. Individually Tailored Medicine People react to different drugs indifferent ways. The vision: a simple DNA test would help todetermine which medicine to prescribe. CPM 2006

  5. International consortium that aims in genotyping the genome of 270 individuals from four different populations. • Launched in 2002. First phase was finished in October (Nature, 2005). CPM 2006

  6. Motivation Genetic Factors (50%) Complexdisease Environmental Factors (50%) Multiple genes may affect the disease. Therefore, the effect of every single gene may be negligible. CPM 2006

  7. Disease Association StudiesThe search for genetic factors • Comparing the DNA contents of two populations: • Cases - individuals carrying the disease. • Controls - background population. A significant discrepancy between the two populations is an evident to a causal gene. CPM 2006

  8. Associated SNP Where should we look? Usually SNPs are bi-allelic (only two letters appear). SNP= Single Nucleotide Polymorphism Cases: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC Controls: AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC CPM 2006

  9. Genotyping Technology • Extracting the allele information for a SNP from a DNA sample. • Considerable genotyping costs reductions in the last couple of years. • Current cost allows for the genotyping of 500,000 SNPs for ~$1000 (compared to ~50 cents per SNP 3-4 years ago). CPM 2006

  10. Computational Challenges CPM 2006

  11. Haplotypes • SNPs in physical proximity are correlated. • A sequence of alleles along a chromosome are called haplotypes. CPM 2006

  12. Haplotype Block Structure (Daly et al., 2001) Block 6 from Chromosome 5q31 CPM 2006

  13. 000 001 111 Tag SNPs Haplotypes as Proxies for Rare SNPs Common haplotypes: • 011000111 (23% of population) • 000001111 (55% of population) • 111111111 (14% of population) CPM 2006

  14. Tag SNP Selection • Input: a set of genotypes • Goal: find a set of t tag SNPs such that using these SNPs only, the error rate for the prediction of all other SNPs is minimized. Formulation by [H., Kimmel, Shamir, 05’] (STAMPA) CPM 2006

  15. Correlations between SNPs Tag SNPs Cases: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA Controls: AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTA CPM 2006

  16. intermediate SNPs SNP j SNP k Basic Assumption Given two SNPs, the probabilities of the values at any intermediate SNPs do not change if we know the values of additional distal ones. CPM 2006

  17. intermediate SNPs SNP j SNP k Test genoteype STAMPA (Selection of TAg SNPs to Maximize Prediction Accuracy) 1. Put aside one test genotype. Use the rest of the data to develop a majority rule for each pair of SNPs to predict intermediate SNPs values. 2. Average prediction error over all test genotypes gives a score to the pair j and k. 3. Apply dynamic programming to obtain best set of tag SNPs. CPM 2006

  18. Comparison: STAMPA vs. ldSelect x - STAMPA, - ldSelect 52 sets of Yoruba genotypes (Gabriel et al., 2002). CPM 2006

  19. The haplotype ancestral structure of two subtypes of NHL. The trees are automatically generated by HAP (H., Eskin, 04’). CPM 2006

  20. Genotype T C C ì ü ì ü ì ü mother chromosome father chromosome A CG í ý í ý í ý G A A î þ î þ î þ ATACGA AGCCGC AGACGA ATCCGC Possible phases: …. Phasing Haplotypes • Cost effective genotyping technology gives genotypes and not haplotypes. ATCCGA AGACGC CPM 2006

  21. Public Genotype Data Growth Perlegen Data Science 1,570,000 SNPs 100,000,000 genotypes HapMap Phase 2 5,000,000+ SNPs 600,000,000+ genotypes TSC Data Nucleic Acids Research 35,000 SNPs 4,500,000 genotypes NCBI dbSNP Genome Research 3,000,000 SNPs 286,000,000 genotypes Daly et al. Nature Genetics 103 SNPs 40,000 genotypes Gabriel et al. Science 3000 SNPs 400,000 genotypes 2001 2002 2003 2004 2005 2006 - HAP’s speed allows it to phase whole-genome datasets - HAP is very accurate (Marchini et al., 2006). CPM 2006

  22. HAP Phasing Model 00000 • A directed phylogenetic tree. • {0,1} alphabet. • Each site mutates at mostonce. • No recombination. • Goal: Finding a phase that fits the tree modelFormulation: [Gusfield, 2003] 2 01000 1 5 11000 01001 3 11100 4 11110 CPM 2006

  23. 2 01000 1 5 11000 01001 3 4 11100 01011 Example 00000 Genotypes 02022 22200 21222 21200 02000 01022 Haplotypes 00000 01000 11100 01011 Given the tree and the haplotypes the phase is unique CPM 2006

  24. Phasing via Greedy • A simple heuristic: • Find a haplotype that is compatible with as many genotypes as possible. • Assign the haplotype for these genotypes. • Continue with the rest of the genotypes. • Intuition: Haplotypes with missing data. CPM 2006

  25. Haplotypes with missing data Input: 111*11*1 00*01*1* 01*000*0 11*11*11 *111**00 1111*11* 01*00010 Output: 11111111 00001111 01000010 11111111 11110000 11111111 01000010 Goal: Find a maximum likelihood phase. CPM 2006

  26. Greedy Analysis (H., Karp, 2005) • Maximum likelihood == minimum entropy solution. • Entropy(Greedy) < Entropy(OPT) + 3. • Can be viewed as a variant of set cover. CPM 2006

  27. Mother, Father, Child Trios • Advantages: • Better phasing results(Marchini et al., 06’). • Population stratification(Spielman et al., 93’). • Disadvantage: • 50% more expensive (and thus, reduces power). CPM 2006

  28. 10011? 11111? 1??11? 1??11? 10?11? 11?11? 1??11? 1??11? ?100?? ?100?? 1100?? 0100?? 11000? 01001? 1100?? 0100?? 1?0??? 1?0??? 100??? 110??? 10011? 11000? 1?0??? 1?0??? Inferring Haplotypes From Trios Parent 1 122112 Parent 2 210022 120222 Child Assumption: No recombination CPM 2006

  29. Genotyping Trios via DNA pools[Beckman, Abel, Braun, H.] M F C CPM 2006

  30. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Mother transmitted allele A A A A A A A A G G G G G G G G Mother untransmitted allele A A A A G G G G A A A A G G G G Father transmitted allele A A G G A A G G A A G G A A G G Father untransmitted allele A G A G A G A G A G A G A G A G Father and Child pool – allele frequency 0 1 2 3 0 1 2 3 1 2 3 4 1 2 3 4 Mother and Child pool – allele frequency 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 • Every configuration has a different pair of values. • Except for configurations 7 and 10 (het-het-het). CPM 2006

  31. Genotyping Unrelated Individuals Edge size  pool size (accuracy) Vertex degree  amount of DNA used CPM 2006

  32. An algebraic view CPM 2006

  33. For every m, what is the largest n, so that m equations uniquely determine the n {0,1,2} variables? For every m, what is the largest n for which A  {0,1}mn, s.t. x,x’ {0,1,2}n , Ax=Ax’ x=x’ CPM 2006

  34. Lower Bound • A random matrix A. • For every x {-2,-1,0,1,2}n, Aix=0 with prob. O(k-0.5) where k is the number of non-zero elements. • Since the rows are independent, the probability that Ax = 0 is O(k-m/2). • Using union bound, n=(m log m). CPM 2006

  35. Upper Bound • Counting argument: • There are at most (2n)m different values that Ax can take. • There are 3n values for x. • 3n< (2n)m and so n < O(m log m). CPM 2006

  36. Further Challenges • Population stratification • In case/control studies and in family based studies. • Admixed populations. • Other pooling schemes • Practical considerations: error rates, missing data, scalability, etc. • Inferring evolutionary processes (e.g. selection, recombination rate, haplotype ancestry, etc.). CPM 2006

  37. Summary • Exciting times in genetics: changes in medicine may be felt in our lifetime. • An opportunity for Computer Scientists to have a huge impact. • An interdisciplinary work is needed. It involves computer science,statistics, genetics, biology,and medicine. CPM 2006

  38. UCSD Eleazar Eskin. Tel-Aviv U. Ron Shamir Gad Kimmel Noga Alon HIIT MattiKaariainen SequenomInc. Andreas Braun Ken Abel Perlegen Sciences David Hinds David Cox UC Berkeley Richard Karp Chris Skibola MPI ReneBeier CHORI KennyBeckman Acknowledgement CPM 2006

  39. Gracies per la vostra atencio!!! CPM 2006

More Related