1 / 30

Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event. Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis. Haplotyping Problem. Diploid organisms have two copies of (not identical) chromosomes.

tyson
Download Presentation

Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis WABI 2005

  2. Haplotyping Problem • Diploid organisms have two copies of (not identical) chromosomes. • A single copy is haplotype, a vector of Single Nucleotides Polymorphisms (SNPs) • SNP: a site with two types of nucleotides occur frequently, 0 or 1 • The mixed description is genotype, vector of 0,1,2 • If both haplotypes are 0, genotype is 0 • If both haplotypes are 1, genotype is 1 • If one is 0 and the other is 1, genotype is 2

  3. Haplotypes and Genotypes Sites: 1 2 3 4 5 6 7 8 9 • Haplotype Inference (HI) Problem: given a set of n genotypes, infer n haplotype pairs that form the given genotypes 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 Two haplotypes per individual Merge the haplotypes Genotype for the individual 2 1 2 1 0 0 1 2 0

  4. Perfect Phylogeny Haplotyping (PPH) • Finding original haplotypes in nature hopeless without genetic model to guide solution picking • Gusfield (2002) introduced PPH problem • PPH is to find HI solutions that fit into a perfect phylogeny. • Nice results for PPH, including a linear time algorithm

  5. The Perfect Phylogeny Model for Haplotypes Assume at most 1 mutation sites 12345 at each site Ancestral sequence 00000 1 4 Site mutations on edges 3 00010 2 The tree derives the set M: 10100 10000 01011 01010 00010 10100 5 10000 01010 01011 Extant sequences at the leaves

  6. PPH Example Inferred Haplotypes Genotypes Perfect Phylogeny

  7. Imperfect Phylogeny Haplotyping (IPPH): Extending PPH • Often, the real biological data does not have PPH solutions. • Eskin, et al (2003) found deleting small part of data may lead to PPH solution (heuristic) • Our approach: IPPH with explicit genetic model, with small amount of • Homoplasy, i.e. back or recurrent mutation • Recombination • Goal: Extend usage of PPH • Real data: may be of small perturbation from PPH • Haplotype block: low recombination or homoplasy

  8. Back/Recurrent Mutation for Haplotypes More than one mutation at a site 000 2 1 010 Data 000 010 101 110 100 1 3 010 110 101 000

  9. 11000 0000001111 breakpoint Recombinations: Single Crossover • Recombination is one of the principle genetic force shaping genetic variations • Two equal length sequences generate the third equal length sequence 110001111111001 000110000001111 Suffix Prefix

  10. IPPH (Imperfect Phylogeny Haplotyping) Problems • Small deviation from PPH • H-1 IPPH problem • Find a tree that allows exactly one site to mutate twice • The rest of sites can only mutate at most once • Derive haplotypes for the given genotypes • R-1 IPPH problem • Find a network that has exactly one recombination event • Each site mutates at most once • Derive haplotypes for the given genotypes

  11. Number of Minimum Recombinations for Haplotypes Frequency of Minimum recombinations for small rho (scaled recombination rate) 20 sequences 30 sites 500 simulations

  12. 000 1 Homoplasy Tree 2 1 010 100 1 3 a2 b2 b1 a1 Haplotyping with One Homoplasy More than one mutation at a site 1 Haplotype Genotype

  13. Algorithm for H1-IPPH • For each site s in the input genotype data M • Test whether M-{s} has PPH solutions • If not, move to next site. • Otherwise, check whether 1 homoplasy at site s can lead to HI solutions • If yes, stop and report result • Assume only one PPH solution for M-{s} • But how to find solutions with 1 homoplasy at s efficiently?

  14. M-{i3} {i3} Site i3 Example M

  15. Combine Mh-{i3} with h{i3} Assume Mh-{i3} is fixed. Haplotypes for the same genotype must pair up. Two ways to pair r2 s2 r2’ s2’ Mh-{i3} h{i3} M-{i3} {i3} PPH

  16. Mh-{i3} h{i3} Mh1 Mh2 ? • 4 ways to try pairing i3. • Exponential number in general, even for one PPH solution • Need polynomial-time method to avoid trying all the pairings

  17. Move to Trees Mh-{i3} h{i3} Convert perfect phylogeny tree from PPH solution to un-rooted

  18. Tree Tr Ts s L1, L2 O1, O2 1 Homoplasy: from T to Tr, Ts Tree T s s O1 L1 L2 O2 Recurrent mutation @ site s Deleting s induces tree Tr s induces a split Ts

  19. Tree T s s O1 L1 L - L1 O2 From Tr, Ts to T Tree Tr L - L1 L1 Ts s L Find two subtrees Ts1, Ts2, in Tr, s.t. O Ts1, Ts2 corresponds to one side of Ts

  20. 1. Pick one side of partition from Ts 2. Pick leaves from Tr corresponding the chosen partition side 3. Check whether the selected leaves fit into two sub-trees

  21. s2 can pair with r2’ 1. May need to refine a non-binary vertex before picking subtree

  22. Solution

  23. Algorithms and Results • Efficient graph-coloring based method to select two subtrees (skipped) • Implemented in C++ • Simulation with data with program ms. • Compare to PHASE (a haplotyping program) • Accuracy: comparable • Speed: at least 10x faster • 100x100 data: about 3 seconds • Can identify the homoplasy site with high accuracy: >95% in simulation

  24. Algorithm for R1-IPPH ML MR M Split M by cutting between two sites

  25. PPH Solutions Build perfect phylogeny for two partitions

  26. 1-SPR operation SPR: subtree-prune-regraft operation 1 recombination condition equivalent to distance-SPR(TL,TR) = 1

  27. Algorithm for R1-IPPH • Brute-force 1-SPR idea leads to exponential time when TL or TR are not binary. • Trickier than H1-IPPH, but with care, R1-IPPH can be solved in polynomial time. (not in paper)

  28. Conclusions • Contributions • Assuming bounded number of PPH solutions • Polynomial time algorithm for H1-IPPH problem • Polynomial time algorithm for R1-IPPH problem • Possible extension to more than 1 homoplasy event. • Open problems • Haplotyping with more than 1 recombination efficiently. • Remove assumption that number of PPH solutions for M-{s} is bounded.

  29. Thank you • Questions?

More Related