1 / 31

Constrained Hidden Markov Models for Population-based Haplotyping

Application of Probabilistic ILP II, FP6-508861 www.aprill.org. Constrained Hidden Markov Models for Population-based Haplotyping. Niels Landwehr Joint work with Taneli Mielikäinen, Lauri Eronen, Hannu Toivonen, Heikki Mannila University of Freiburg / University of Helsinki.

dolf
Download Presentation

Constrained Hidden Markov Models for Population-based Haplotyping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application of Probabilistic ILP II, FP6-508861 www.aprill.org Constrained Hidden Markov Models for Population-based Haplotyping Niels Landwehr Joint work with Taneli Mielikäinen, Lauri Eronen, Hannu Toivonen, Heikki Mannila University of Freiburg / University of Helsinki

  2. Outline • Population-based haplotype reconstruction • Infer haplotypes from genotypes: reconstruct hidden phase of genetic data • Important problem in biology/medicine: e.g. disease association studies • An approach using constrained HMMs • Sparse markov chains to represent conserved haplotype fragments • HMM model that can be learned directly from genotype data • Experimental results

  3. Human Genome and SNPs SNP (marker) SNP (marker) SNP (marker) ...GATATTCGTACGGATGTTTCCA... ...GATGTTCGTACTGATGTCTCCA... ...GATATTCGTACGGATGTTTCCA... ...GATATTCGTACGGATGTTTCCA... ...GATGTTCGTACTGATGTCTCCA... ...GATGTTCGTACTGATGTCTCCA... Individuals 1 2 3 4 5 6 DNA Sequence

  4. Haplotypes AGT GTC AGT AGT GTC GTC Haplotypes SNP SNP SNP AGT GTC AGT AGT GTC GTC Individuals 1 2 3 4 5 6 DNA Sequence

  5. Haplotypes 101 010 101 101 010 010 Haplotypes SNP SNP SNP 101 010 101 101 010 010 Individuals 1 2 3 4 5 6 DNA Sequence

  6. Why Haplotypes? • Haplotypes • define our genetic individuality • contribute to risk factors of complex diseases (e.g., diabetes) • Disease Association Studies (Gene Mapping): • find genetic difference between a case and a control population • Identifying SNPs responsible for disease might help find a cure • Also useful for • Linkage disequilibrium studies: Summarize genetic variation • Understanding evolution of human populations

  7. WetLab: only genotype information (two alleles for each SNP, but chromosome origin is unknown) {0,1} {0,1} {0} {0,1} {1} The problem: Haplotypes not directly observable . 1 . . . 1 . . . 0 . . . 0 . . . 1 . . 0 . . . 0 . . . 0 . . . 1 . . . 1 . Paternal Maternal

  8. Population-based Haplotype Reconstruction • Given the genotypes of several individuals, infer for every individual the most likely underlying haplotype pair • Hidden data reconstruction problem using probabilistic model: exploit patterns in the haplotypes (linkage disequilibrium) haplotype pair genotype 1 0 1 0 0 1 1 1 0 1 1 1 0 0 1 0 {0,1} {0,1} {0,1} {1} {0,1} {1} {0} {0,1} 1 0 1 0 0 1 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 {0,1} {0,1} {1} {1} {1} {1} {0} {1} {0,1} {0,1} {0,1} {1} {0,1} {1} {0} {1} Individual 2 Individual 3 … Individual 1

  9. Haplotype Reconstruction Problem (CS Perspective) Input: A set G of genotypes Output: A set H of corresponding haplotype pairs such that

  10. Population-based Haplotype Reconstruction • Given a model M for the distribution of haplotypes, can infer most likely resolution: Hardy-Weinberg equilibrium • Need to estimate this model from available genotype data

  11. Prior Work on Haplotype Reconstruction • Competitive application domain for several years: many systems developed • characterized by the statistical model and learning/reconstruction algorithms employed • Special-purpose statistical models • Approximate Coalescent (PHASE 2001,2003,2005) • Block-based (Gerbil 2004,2005) • Variable-length MC (HaploRec 2004,2006) • Founder-based (HIT 2005) • Local clusters (fastPHASE 2006)

  12. Prior Work on Haplotype Reconstruction • Special-purpose learning/reconstruction algorithms • MCMC variant • Approximate EM + partition ligation • … • Our approach: • Model haplotypes using (sparse) markov chains • Natural extension to a Hidden Markov Model on genotypes • Directly learnable from genotype data (standard Baum-Welsh)

  13. Constrained HMMs for haplotyping • Modeling haplotypes • Standard markov chain • More general: order k markov chain Path for haplotype 0,1,1,0

  14. Constrained HMMs for haplotyping • Modeling genotypes • Hidden phase (order of pair): Hidden Markov Model • States: pairs of states of the underlying markov chain (state of the maternal/paternal sequence) • Output symbol: unordered pair • Path in the model: sample two haplotypes, output corresponding genotype • Have to enforce Hardy-Weinberg equilibrium • Parameter tying constraints on transition probabilities • Algorithms • Learning: standard Baum-Welsh • Reconstruction of most likely haplotype pair: Viterbi

  15. Constrained HMMs for haplotyping • Example: paths for genotype {0,1},{1},{0,1},{0}

  16. Sparse Markov Modeling (SpaMM) • Higher-order models (long history) needed: exponential size of model • However, out of the possible history blocks, only few occur in data (conserved fragments) • Idea: Sparse model, iterative structure learning algorithm to identify conserved fragments (Apriori-style) Initialize first-order-model() em-training( ) repeat regularize-and-extend( ) em-training( ) until

  17. SpaMM Model (order 1) • Iteration: extend order of model by 1, prune unlikely parts • Avoids combinatorial explosion of model size • Initial model: standard markov chain of order 1

  18. SpaMM Model (order 2) • Iteration: extend order of model by 1, prune unlikely paths • Avoids combinatorial explosion of model size

  19. SpaMM Model (order 3) • Iteration: extend order of model by 1, prune unlikely paths • Avoids combinatorial explosion of model size

  20. SpaMM Model (order 4) • Iteration: extend order of model by 1, prune unlikely paths • Avoids combinatorial explosion of model size

  21. SpaMM Model (order 5) • Iteration: extend order of model by 1, prune unlikely paths • Avoids combinatorial explosion of model size

  22. SpaMM Model (order 6) • Iteration: extend order of model by 1, prune unlikely paths • Avoids combinatorial explosion of model size

  23. SpaMM Model (final) • Final model: Model structure encodes conserved fragments • Concise representation of all haplotypes with non-zero probability

  24. Experimental Evaluation • Real world population data • Correct haplotypes have been inferred from trios • Daly dataset: 103 SNP markers for 174 individuals • Yoruba population: 100 datasets, 500 SNP markers each, 60 individuals • Problem Setting: • Given the set of genotypes, algorithm outputs most likely haplotype pairs • Difference to real haplotype pairs is measured in switch distance (# recombinations needed to transform pairs, normalized)

  25. Results: Haplotype Reconstruction • Many well-engineered systems • Smart priors, averaging over several random restarts of EM, ... • SpaMM: proof-of-concept implementation, not tuned

  26. Results: Haplotype Reconstruction • PHASE most accurate, then fastPHASE, then SpaMM • however, PHASE too slow for long maps • SpaMM beats fastPHASE without averaging • overall, competitive accuracy

  27. Results: Runtime • Runtime in seconds for phasing 100 markers (log. scale) • SpaMM scales linearly in #markers • like fastPHASE, HaploRec, HIT • unlike PHASE, Gerbil

  28. Results: Genotype imputation • Most haplotyping methods can also predict missing genotype values • for SpaMM, can be read off Viterbi path

  29. Results: Genotype imputation • fastPHASE best known method • Again, SpaMM beats fastPHASE without averaging

  30. Conclusions • SpaMM: new haplotyping method • sparse Markov chains to encode conserved haplotype fragments • Constrained HMM for modeling genotypes • Apriori-style structure learning algorithm • Simple, accurate, interpretable output • Future work • Accuracy can probably be improved using standard techniques (EM random restarts, averaging, ...)

  31. Thanks!

More Related