1 / 28

Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference

Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference. Ion Mandoiu University of Connecticut CS&E Department. Outline. Biological background Maximum likelihood tag SNP selection Maximum likelihood population haplotyping Ongoing and future work.

Download Presentation

Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department

  2. Outline • Biological background • Maximum likelihood tag SNP selection • Maximum likelihood population haplotyping • Ongoing and future work

  3. Genomic Variation and SNPs • Human Genome  3  109 base pairs • Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) • Single base changes in the genome sequence that occurs in a significant proportion (more than 1 percent) of the population • Most SNPs are bi-allelic • Total #SNPs  1  107 • Difference b/w any two individuals 3  106 SNPs ( 0.1% of entire genome)

  4. Haplotypes and Genotypes • Diploid organisms: cells have two homologous sets of chromosomes • Haplotype: description of SNP alleles on a chromosome • 0/1 vector, e.g., 00110101 (0 is for major, 1 is for minor allele) • Genotype: combined description of SNP alleles on pairs of homologous chromosomes • 0/1/2 vector, e.g., 01122110 (0=0+0, 1=1+1, 2=0+1 or 1+0) • Each genotype with k 2’s can be explained by 2k-1 pairs of haplotypes 1 1 0 1 1 0 0 1   0 0 0 1 0 1 0 1 2 2 0 1 2 2 0 1

  5. Computational Challenges • Limitations of current technologies: • High cost per (user selected) SNP  Tag SNP selection problem • Find genotypes, not haplotypes  Haplotype inference problem • Effective solutions require combining accurate probabilistic models with scalable combinatorial optimization techniques!

  6. Outline • Biological background • Maximum likelihood tag SNP selection • Maximum likelihood population haplotyping • Ongoing and future work

  7. Two-Stage Sampling Methodology • Pilot Study • All SNPs of interest are genotyped in a small sample of the population • Commonhaplotypes are inferred using statistical methods • A set of tag SNPsis selected • Population Study • Tag SNPs are genotyped in remaining population • Statistical methods are used to infer haplotypesover the tag SNPs • Haplotypesover the tag SNPs are extrapolated to full haplotypes

  8. Flow 1: Haplotype-Extrapolation Population Study Pilot Study Population Sample Remaining Population Genotypes (all SNPs) Genotypes (tag SNPs) Phasing Phasing Sample haplotypes (with frequencies) Haplotype pairs (tag SNPs) Extrapolation Tag SNP Selection Haplotype pairs (all SNPs) Tag SNP Set

  9. Flow 2: Genotype-Extrapolation Population Study Pilot Study Population Sample Remaining Population Genotypes (all SNPs) Genotypes (tag SNPs) Phasing Extrapolation Sample haplotypes (with frequencies) Genotypes (all SNPs) Phasing Tag SNP Selection Haplotype pairs (all SNPs) Tag SNP Set

  10. Previous Works on Tag SNP Selection • Statistical correlation based methods • Poor control over the number of tag SNPs • [Bafna et al. 03] Informative SNP Set Problem • Find set of k SNPs with maximum “informativeness” • [Sebastiani et al. 03]Best Enumeration of SNP Tags (BEST) • Finds minimum number of SNPs that distinguishes all given haplotypes • No control over the number of tag SNPs!

  11. Fully Informative Tag SNP Set Selection by Integer Programming • Given: haplotypes h1, h2, …, hmover n SNPs • Find: minimum number of tag SNPs • Such that: every two distinct haplotypes differ in at least one tag SNP • Integer Program Formulation • 0/1 variable xj for every SNP • xj = 1 if SNP j is selected as a tag SNP • xj= 0 otherwise • Can be solved efficiently using general purpose solvers such as CPLEX • In practice significantly faster than BEST

  12. Extrapolation Approaches • [Halperin et al. 05] • Each SNP genotype predicted individually • Only immediate neighbor tag SNPs used in prediction • [He&Zelikovsky 06] • Each SNP genotype predicted individually • All tag SNPs used in prediction • Maximum likelihood • Pick the most likely full genotype compatible with short genotype over tag SNPs • Full genotype predicted in a single step

  13. Tag SNP 1 h1 h2 hn Tag SNP 2 h3 Tag Selection for Maximum Likelihood Genotype Extrapolation Idea: Select K tag SNPs maximizing correct prediction probability

  14. Tag Selection for Maximum Likelihood Genotype Extrapolation

  15. Experimental Setup • Synthetic datasets generated following [Forton et al. 05] • - 2 populations (European and West African) + 2 genomic regions (IL8 and 5q31) • - For each of the 4 populations, we used haplotypes and frequencies inferred in [Forton et al. 05] from the real data to generate 5 datasets containing between 200 and 1000 individuals • - Fixed blocksize of 10 SNPs • - For each dataset we picked 5 random samples with size 50 • Maximum likelihood (ML) flows 1 and 2 were compared to the Multivariate Linear Regression (MLR) algorithm of [He&Zelikovsky 06] • Genotype frequencies estimated from haplotype frequencies used to generate the datasets (pop), respectively from haplotype frequecies inferred from sample using PHASE (phase)

  16. Haplotype Accuracy

  17. Genotype Accuracy

  18. Outline • Biological background • Maximum likelihood tag SNP selection • Maximum likelihood population haplotyping • Ongoing and future work

  19. Population Haplotyping Problem Given the set G of genotypes observed in a population of individuals, infer a set H of haplotypes explaining G • Numerous approaches: entropy minimization, perfect phylogeny, Bayesian networks, pure parsimony, … • Maximum likelihood approach: • Estimate for each haplotype h its probability ph in the population under study • Find set H that explains G and has maximum likelihood

  20. Graph Theoretical Reformulation • Haplotypes  graph vertices • - Weight of vertex h = -log(ph) • Genotypes  edge colors • Edge (h, h’) with color g iff g can be explained by haplotypes h and h’ 2201 0001 1101 2101 1201 2201 0101 1001 Minimum Weight Multi-Colored Subgraph Problem (MWMCSP): Find min-weight set of vertices that induce at least one edge of each given color

  21. Approximation Algorithms • [Lancia et al. 02] • Algorithms with approximation factors of (for unweighted version) and q, where n is the number of genotypes and q is the maximum number of haplotype pairs compatible with a genotype • [Huang et al. 05] • O(log n) approximation using semidefinite programming, but big O constant hides factor of q • [Hassin&Segev 05] • Greedy algorithm with approximation factor of • [Hajiaghayi et al. 06] • LP-rounding algorithm with approximation factor of

  22. Integer Program Formulation • Extends formulation of [Gusfield 03] • 0/1 variable xu for every vertex u • xu is set to 1 if u is selected, 0 otherwise • 0/1 variable ye for every edge e • ye set to 1 ife is induced by selected vertices, 0 otherwise

  23. Outline • Biological background • Maximum likelihood population haplotyping • Maximum likelihood tag SNP selection • Ongoing and future work

  24. Haplotype Frequency Estimation • Accurate haplotype frequency estimation becomes key to overall accuracy of likelihood maximization methods • Important to capture frequencies of haplotypes that may not appear in the sample – phasing and counting gives poor estimates • Existing high-quality algorithms, e.g., Haplofreq [Halperin&Hazan 05], do not have good scaling runtime

  25. HMM-Based Frequency Estimation • Hidden Markov Models (HMMs) are uniquely suited for modeling haplotype frequencies in a population • Recently used very successfully in haplotype inference [Rastas et al. 05], disease association [Kimmel&Shamir 05] • Main computational bottleneck: HMM training based on genotype data

  26. HMM-Based Frequency Estimation • Good compromise in context of two stage experiments • Sample consisting of trios (child, mother, and father) • Sample phased using fast trio-aware phasing method (e.g., entropy phasing [Pasaniuc&M 06]) • HMM trained on resulting (highly accurate) haplotypes • Haplotype frequencies computed efficiently using k-shortest paths algorithm

  27. Other Problems • Identification of genotyping errors by likelihood maximization [Becker et al. 06] • Pedigree reconstruction and kinship analysis • Population structure • Bicriteria tag SNP selection: likelihood maximization and genotyping cost optimization

  28. Acknowledgments • J. Jun, B. Pasaniuc (UCONN) • M.T. Hajiaghayi (CMU), K. Jain (Microsoft Research), L.C. Lau (U. Toronto), A. Russell (UCONN), V.V. Vazirani (Georgia Tech) • Funding from NSF (CAREER Award IIS-0546457) and UCONN Research Foundation

More Related