1 / 35

Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology

Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology. Ion Mandoiu University of Connecticut. Outline. HMM model of haplotype diversity Applications Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions.

gypsy
Download Presentation

Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology Ion Mandoiu University of Connecticut

  2. Outline • HMM model of haplotype diversity • Applications • Phasing • Error detection • Imputation • Genotype calling from low-coverage sequencing data • Conclusions

  3. Single Nucleotide Polymorphisms • Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) • High density in the human genome:  1  107 SNPs out of total 3  109 base pairs … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcCgtatacacgggTctata … … ataggtccCtatttcgcgcCgtatacacgggTctata …

  4. 011100110 001000010 021200210 + Haplotypes and Genotypes • Diploids: two homologous copies of each autosomal chromosome • One inherited from mother and one from father • Haplotype: description of SNP alleles on a chromosome • 0/1 vector: 0 for major allele, 1 for minor • Genotype: description of alleles on both chromosomes • 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles two haplotypes per individual genotype

  5. Sources of Haplotype Diversity: Mutation The International HapMap Consortium. A Haplotype Map of the Human Genome. Nature 437, 1299-1320. 2005.

  6. Sources of Haplotype Diversity: Recombination

  7. Haplotype Structure in Human Populations

  8. F1 F2 Fn H1 H2 Hn HMM Model of Haplotype Frequencies • Fi = founder haplotype at locus i, Hi = observed allele at locus i • P(Fi), P(Fi | Fi-1) and P(Hi | Fi) estimated from reference genotype or haplotype data • For given haplotype h, P(H=h|M) can be computed in O(nK2) using forward algorithm • Similar models proposed in [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06]

  9. Outline • HMM model of haplotype diversity • Applications • Phasing • Error detection • Imputation • Genotype calling from low-coverage sequencing data • Conclusions

  10. h1:0010111 h2:0010010 ? g: 0010212 h3:0010011 h4:0010110 Genotype Phasing

  11. Maximum Likelihood Genotype Phasing … F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n H'1 H'2 H'n G1 G2 Gn • Maximum likelihood genotype phasing: given g, find (h1,h2) = argmaxh1+h2=gP(h1|M)P(h2|M)

  12. Computational Complexity • [KMP08] Cannot approximate maxh1+h2=gP(h1|M)P(h2|M) within a factor of O(n1/2 -), unless ZPP=NP • [Rastas et al.] give Viterbi and randam sampling based heuristics that yield phasing accuracy comparable to best existing methods (PHASE)

  13. Outline • HMM model of haplotype diversity • Applications • Phasing • Error detection • Imputation • Genotype calling from low-coverage sequencing data • Conclusions

  14. Genotyping Errors • A real problem despite advances in technology & typing algorithms • 1.1% of 20 million dbSNP genotypes typed multiple times are inconsistent [Zaitlen et al. 2005] • Systematic errors (e.g., assay failure) typically detected by departure from HWE [Hosking et al. 2004] • In pedigrees, some errors detected as Mendelian Inconsistencies (MIs) • Many errors remain undetected • As much as 70% of errors are Mendelian consistent for mother/father/child trios [Gordon et al. 1999]

  15. Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 0 0 0 1 0 1 h3 0 1 1 1 0 0 h4 0 1 1 1 0 0 h1 0 1 0 1 0 1 h2 Child 0 2 2 1 0 2 0 1 1 1 0 0 h1 0 0 0 1 0 1 h3 Likelihood of best phasing for original trio T Likelihood Sensitivity Approach to Error Detection in Trios

  16. Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 0 0 0 1 0 0 h’ 3 0 1 1 1 0 1 h’ 4 0 1 0 1 0 1 h’1 0 1 1 1 0 0 h’2 Child 0 2 2 1 0 2 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3 Likelihood of best phasing for modified trio T’ Likelihood Sensitivity Approach to Error Detection in Trios ? Likelihood of best phasing for original trio T

  17. Likelihood Sensitivity Approach to Error Detection in Trios Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 Child 0 2 2 1 0 2 ? • Large change in likelihood suggests likely error • Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=104)

  18. Alternate Likelihood Functions • [KMP08] Cannot approximate L(T) within O(n1/4 -), unless ZPP=NP • Efficiently Computable Likelihood Functions • Viterbi probability • Probability of Viterbi Haplotypes • Total Trio Probability

  19. Comparison with FAMHAP (Children)

  20. Comparison with FAMHAP (Parents)

  21. Outline • HMM model of haplotype diversity • Applications • Phasing • Error detection • Imputation • Genotype calling from low-coverage sequencing data • Conclusions

  22. Genome-Wide Association Studies • Powerful method for finding genes associated with complex human diseases • Large number of markers (SNPs) typed in cases and controls • Disease causal SNPs unlikely to be typed directly • Significant statistical power gained by performing imputation of untyped Hapmap genotypes [WTCCC’07]

  23. HMM Based Genotype Imputation • Train HMM using the haplotypes from related Hapmap or small cohor typed at high density • Probability of missing genotypes given the typed genotype data  gi is imputed as

  24. Experimental Results • Estimates of the allele 0 frequency based on Imputation vs. Illumina 15k

  25. Experimental Results • Accuracy and missing data rate for imputed genotypes at different thresholds

  26. Outline • HMM model of haplotype diversity • Applications • Phasing • Error detection • Imputation • Genotype calling from low-coverage sequencing data • Conclusions

  27. Ultra-High Throughput Sequencing • New massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to Sanger sequencing Roche / 454 Genome Sequencer FLX 100 Mb/run, 400bp reads Illumina / Solexa Genetic Analyzer 1G 1000 Mb/run, 35bp reads Applied Biosystems SOLiD 3000 Mb/run, 25-35bp reads

  28. Probabilistic Model … F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n H'1 H'2 H'n G1 G2 Gn R1,1 … R1,c R2,1 … R2,c Rn,1 … Rn,c n 1 2

  29. Model Training • Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father • P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise where is the probability that read r has an error at locus I  Conditional probabilities for sets of reads are given by:

  30. Multilocus Genotyping Problem • GIVEN: • Shotgun read sets r=(r1, r2, … , rn) • Base quality scores • HMMs for populations of origin for mother/father • FIND: • Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum posterior probability, i.e., g*=argmaxg P(g | r)

  31. Posterior Decoding Algorithm For each i = 1..n, compute Return • Joint probabilities can be computed using a forward-backward algorithm: • Direct implementation gives O(m+nK4) time, where • m = number of reads • n = number of SNPs • K = number of founder haplotypes in HMMs • Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08]

  32. Genotyping Accuracy on Watson Reads

  33. Outline • HMM model of haplotype diversity • Applications • Phasing • Error detection • Imputation • Genotype calling from low-coverage sequencing data • Conclusions

  34. Conclusions HMM model of haplotype diversity provides a powerful framework for addressing central problems in population genetics & genetic epidemiology Enables significant improvements in accuracy by exploiting the high amount of linkage disequilibrium in human populations Despite hardness results, heuristics such as posterior or Viterbi decoding perform well in practice Highly scalable runtime (linear in #SNPs and #individuals/reads) Software available at http://www.engr.uconn.edu/~ion/SOFT/

  35. Acknowledgements • Sanjiv Dinakar, Jorge Duitama, Yözen Hernández, Justin Kennedy, Bogdan Pasaniuc • NSF funding (awards IIS-0546457 and DBI-0543365)

More Related