1 / 29

CS 394C September 22, 2009

CS 394C September 22, 2009. Divide-and-conquer methods for phylogeny estimation. Reconstructing the Tree of Life. Challenges: - millions of species - lots of missing data. Two possible approaches: - Combined Analysis - Supertree Methods. Two competing approaches .

wayde
Download Presentation

CS 394C September 22, 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 394CSeptember 22, 2009 Divide-and-conquer methods for phylogeny estimation

  2. Reconstructing the Tree of Life Challenges: - millions of species - lots of missing data Two possible approaches: - Combined Analysis - Supertree Methods

  3. Two competing approaches gene 1gene 2 . . . gene k . . . Analyze separately . . . Supertree Method Species Combined Analysis

  4. MRP weighted MRP Min-Cut Modified Min-Cut Semi-strict Supertree MRF MRD QILI SDM Q-imputation PhySIC Majority-Rule Supertrees Maximum Likelihood Supertrees and many more ... Matrix Representation with Parsimony (Most commonly used) Many Supertree Methods

  5. Superfine: a new supertree method Input: unrooted source trees Output: unrooted tree on the entire set of taxa Not yet submitted for publication

  6. Algorithmic strategy • First, construct a supertree with low false positives using the strict consensus merger (Huson et al., 1999) The Strict Consensus Merger • Then, refine the tree to reduce false negatives by resolving each polytomy using Quartet Max Cut (Snir and Rao, 2008)Quartet Max Cut

  7. Algorithmic strategy • First, construct a supertree with low false positives using the strict consensus merger (Huson et al., 1999)The Strict Consensus Merger • Then, refine the tree to reduce false negatives by resolving each polytomy using Quartet Max Cut (Snir and Rao, 2008)Quartet Max Cut

  8. Algorithmic strategy • First, construct a supertree with low false positives using the strict consensus merger (Huson et al., 1999)The Strict Consensus Merger • Then, refine the tree to reduce false negatives by resolving each polytomy using Quartet Max Cut (Snir and Rao, 2008)Quartet Max Cut

  9. e b a e c b a f g d f g b a c h d i j c h i j d Strict Consensus Merger b e a f c d g a b c h d i j

  10. Performance of SCM • Low false positive (FP) rate (Estimated supertree has few false edges) • High false negative (FN) rate (Estimated supertree is missing many true edges)

  11. False Negative Rate Scaffold Density (%)

  12. Uses of Supertree Methods What are these good for? • Combining trees on subsets of taxa • What about using them to do divide-and-conquer analyses? (Improving the accuracy or scalability of phylogeny estimation methods?)

  13. Local optimum Cost Global optimum Phylogenetic trees Phylogenetic reconstruction methods • Polynomial time distance-based methods: Neighbor Joining, FastME, Weighbor, etc. 2. Hill-climbing heuristics for NP-hard optimization criteria (Maximum Parsimony and Maximum Likelihood) • Bayesian methods

  14. “Boosting” MP heuristics • We use “Disk-covering methods” (DCMs) to improve heuristic searches for MP and ML Base method M DCM DCM-M

  15. Rec-I-DCM3 significantly improves performance (Roshan et al.) Current best techniques DCM boosted version of best techniques Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset

  16. DCM1-boosting distance-based methods[Nakhleh et al. ISMB 2001] • Theorem: DCM1-NJ converges to the true tree from polynomial length sequences 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 0 0 400 800 1200 1600 No. Taxa

  17. How do these methods work? • Basic technique: • Divide into overlapping subsets (typically using a chordal graph) • Construct trees on each subset (using base method) • Combine trees into a supertree (using the Strict Consensus Merger)

  18. How do these methods work? • Basic technique: • Divide into overlapping subsets (typically using a chordal graph) • Construct trees on each subset (using base method) • Combine trees into a supertree (using the Strict Consensus Merger)

  19. Possible Research Projects • What about fooling around with this design? • Instead of the Strict Consensus Merger, shall we try SupreFine? • To start with, let’s examine the first DCM methods (DCM-NJ and DCM-MP/ML)

  20. Other problems • Simultaneous estimation of multiple sequence alignments and trees • Phylogeny estimation from whole genomes (specifically, gene order and content phylogeny estimation) • Metagenomic phylogenetic placement

  21. Other problems • Simultaneous estimation of multiple sequence alignments and trees • Phylogeny estimation from whole genomes (specifically, gene order and content phylogeny estimation) • Metagenomic phylogenetic placement

  22. Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA…

  23. Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA…

  24. Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA…

  25. Deletion Mutation The true pairwise alignment is: …ACGGTGCAGTTACCA… …AC----CAGTCACCA… …ACGGTGCAGTTACCA… …ACCAGTCACCA… The true multiple alignment on a set of homologous sequences is obtained by tracing their evolutionary history, and extending the pairwise alignments on the edges to a multiple alignment on the leaf sequences.

  26. X U Y V W AGTGGAT TATGCCCA TATGACTT AGCCCTA AGCCCGCTT U V W X Y

  27. SATé Algorithm (Liu et al., Warnow-Linder lab) SATé = Simultaneous Alignment and Tree Estimation, Science 2009 Obtain initial alignment and estimated ML tree T T Use new tree (T) to compute new alignment (A) Estimate ML tree on new alignment A

  28. SATé Results on 1000 taxon datasets • 24 hour SATé analysis • Other simultaneous estimation methods cannot run on large datasets

  29. Reconstructing the Tree of Life Challenges: - millions of species - lots of missing data Two possible approaches: - Combined Analysis - Supertree Methods

More Related