1 / 18

Algorithmic research in phylogeny reconstruction

Algorithmic research in phylogeny reconstruction. Tandy Warnow The University of Texas at Austin. Phylogeny. From the Tree of the Life Website, University of Arizona. Orangutan. Human. Gorilla. Chimpanzee. Reconstructing the “Tree” of Life. Handling large datasets: millions of species

Download Presentation

Algorithmic research in phylogeny reconstruction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin

  2. Phylogeny From the Tree of the Life Website,University of Arizona Orangutan Human Gorilla Chimpanzee

  3. Reconstructing the “Tree” of Life Handling large datasets: millions of species NSF funds many projects towards this goal, under the Assembling the Tree of Life (ATOL) program

  4. Current projects • Heuristics for NP-hard optimization problems for phylogeny reconstruction • “Phylogenetic” multiple sequence alignment • Detecting and reconstruction horizontal gene transfer and hybridization • Constructing phylogenies on languages Graph-theory, combinatorial optimization, probabilistic analysis, are fundamental to algorithm development in this area. But all methods are extensively tested in simulation and on real data as well. Collaborations with biologists or linguists are essential.

  5. -3 mil yrs AAGACTT AAGACTT -2 mil yrs AAGGCCT AAGGCCT AAGGCCT AAGGCCT TGGACTT TGGACTT TGGACTT TGGACTT -1 mil yrs AGGGCAT AGGGCAT AGGGCAT TAGCCCT TAGCCCT TAGCCCT AGCACTT AGCACTT AGCACTT today AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT DNA Sequence Evolution

  6. Phylogeny Problem U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

  7. Solving NP-hard problems exactly is … unlikely • Number of (unrooted) binary trees on n leaves is (2n-5)!! • If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in 2890 millennia

  8. Local optimum Cost Global optimum Phylogenetic trees Approaches for “solving” hard optimization problems (like maximum parsimony) • Hill-climbing heuristics (which can get stuck in local optima) • Randomized algorithms for getting out of local optima • Approximation algorithms (give bounds on what is possible)

  9. Problems with current techniques for MP Shown here is the performance of a heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%. Performance of TNT with time

  10. Performance of NJ, a popular polynomial time method [Nakhleh et al. ISMB 2001] Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. 0.8 NJ 0.6 Error Rate 0.4 0.2 0 0 400 800 1200 1600 No. Taxa

  11. DCMs (Disk-Covering Methods) • DCMs for polynomial time methods improve topological accuracy (empirical observation), and have provable theoretical guarantees under Markov models of evolution • DCMs for hard optimization problems reduce running time needed to achieve good levels of accuracy (empirically observation)

  12. DCMs: Divide-and-conquer for improving phylogeny reconstruction

  13. “Boosting” phylogeny reconstruction methods • DCMs “boost” the performance of phylogeny reconstruction methods. DCM Base method M DCM-M

  14. Iterative-DCM3 T DCM3 Base method T’

  15. Rec-I-DCM3 significantly improves performance Current best techniques DCM boosted version of best techniques Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset

  16. DCM1-boosting distance-based methods[Nakhleh et al. ISMB 2001] • DCM1-boosting makes distance-based methods more accurate • Theoretical guarantees that DCM1-NJ converges to the true tree from polynomial length sequences 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 0 0 400 800 1200 1600 No. Taxa

  17. General comments • Everything in phylogeny (just about) is NP-hard • Graph-theory, probability, and optimization are the basic tools for algorithmic advances • Algorithms are tested on both real and simulated data. • Collaborations with domain experts (biologists or linguists) essential to success. (At UT, we have wonderful biologists to work with, and all my students collaborate with them.)

  18. For more information • Send me email to make an appointment • Check my webpage for tutorials on the subject See http://www.phylo.org and http://www.cs.utexas.edu/~tandy for more info

More Related