1 / 39

SupreFine, a new supertree method

SupreFine, a new supertree method. Shel Swenson September 17th 2009. Reconstructing the Tree of Life. Tree of Life challenges: - millions of species - lots of missing data. Two possible approaches: - Combined Analysis - Supertree Methods. Two competing approaches.

karolr
Download Presentation

SupreFine, a new supertree method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SupreFine, a new supertree method Shel Swenson September 17th 2009

  2. Reconstructing the Tree of Life Tree of Life challenges: - millions of species - lots of missing data Two possible approaches: - Combined Analysis - Supertree Methods

  3. Two competing approaches gene 1gene 2 . . . gene k . . . Combined Analysis Species

  4. gene 1 gene 3 S1 TCTAATGGAA S1 S2 gene 2 GCTAAGGGAA TATTGATACA S3 S3 TCTAAGGGAA TCTTGATACC S4 S4 S4 TCTAACGGAA GGTAACCCTC TAGTGATGCA S7 S5 TCTAATGGAC GCTAAACCTC S7 TAGTGATGCA S8 S6 TATAACGGAA GGTGACCATC S8 CATTCATACC S7 GCTAAACCTC Combined Analysis Methods

  5. Combined Analysis gene 2 gene 3 gene 1 S1 TCTAATGGAA ? ? ? ? ? ? ? ? ? ? TATTGATACA S2 GCTAAGGGAA ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S3 TCTAAGGGAA TCTTGATACC ? ? ? ? ? ? ? ? ? ? S4 TCTAACGGAA GGTAACCCTC TAGTGATGCA S5 GCTAAACCTC ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S6 GGTGACCATC ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S7 TCTAATGGAC GCTAAACCTC TAGTGATGCA S8 TATAACGGAA CATTCATACC ? ? ? ? ? ? ? ? ? ?

  6. Two competing approaches gene 1gene 2 . . . gene k . . . Analyze separately . . . Supertree Method Species Combined Analysis

  7. Missing data Large dataset sizes Incompatible data types (e.g., morphological features, biomolecular sequences, gene orders, even distances based upon biochemistry) Unavailable sequence data (only trees) Why use supertree methods?

  8. MRP weighted MRP Min-Cut Modified Min-Cut Semi-strict Supertree MRF MRD QILI SDM Q-imputation PhySIC Majority-Rule Supertrees Maximum Likelihood Supertrees and many more ... Matrix Representation with Parsimony (Most commonly used and most accurate) Many Supertree Methods

  9. Today’s Outline  • Supertree and combined analysis methods • Why we need better supertree methods • SuperFine: a new supertree method that is fast and more accurate than other supertree methods • Strict Consensus Merger (SCM) • Resolving polytomies • Performance of SuperFine (compared to MRP and combined anaylses) • applications and future work

  10. gene 1 gene 2 . . . gene k . . . Taxa 6. Compare to Model Tree 4. Construct Source Trees 5. Apply Supertree Method . . . Previous Simulation Studies 1. Generate Model Tree 3. Select Subsets 2. Generate sequence data

  11. What does lead to missing data? • Evolution (gain and loss of genes) • Dataset selection • Limited resources (time, money, etc.)

  12. My Simulation Study • Generate model trees (100-1000 taxa) • Simulate gene gain and loss and generate sequences • Simulate techniques for gene and taxon selection • Clade-based datasets • Scaffold dataset • Generate source trees and a combined dataset • Apply supertree and combined analysis methods • Compare each estimated tree to the model tree, and record topological error

  13. Experimental Parameters • Number of taxa in model tree: 100, 500, and 1000 • Generate 5, 15 and 25 clade-based datasets, respectively • Scaffold density: 20%, 50%, 75%, and 100% • Six super-methods: • Combined analysis using ML and MP • MRP on ML and MP source trees • Weighted MRP on ML and MP source trees (MRP = Matrix Representation with Parsimony)

  14. Quantifying Topological Error C D D A E A C E B F B F • False negative (FN): An edge in the true tree missing from the estimated tree • False positive (FP): An edge in the estimated tree not in the true tree True Tree Estimated Tree

  15. Comparison of MRP-ML and CA-ML(False Negative Rate) Scaffold Density (%)

  16. We still need supertree methods! Combined analysis cannot be used for: • Datasets that are very large • Incompatible data types • Unavailable sequence data

  17. Outline  • Supertree and combined analysis methods • Why we need better supertree methods • SuperFine: a new supertree method that is fast and more accurate than other supertree methods • Strict Consensus Merger (SCM) • Resolving polytomies • Performance of SuperFine (compared to MRP and combined anaylses) • applications and future work 

  18. Methods that Led to SuperFine • The Strict Consensus Merger (SCM) (Huson et al. 1999) • Quartet MaxCut (QMC) (Snir and Rao 2008)

  19. e b a e c b a f g d f g b a c h d i j c h i j d Strict Consensus Merger (SCM) b e a f c d g a b c h d i j

  20. Theorem Let S be a collection of source trees and T be a SCM tree on S. Then for every sin S, ∑(T|L(s))  ∑(s), where T|L(s) is the induced subtree of T on the leafset of s.

  21. e b a f g c h d i j Intuition for the Theorem e b b e a a c f c d f g g d a b b a c h c i j d h d i j

  22. Performance of SCM • Low false positive (FP) rate (Estimated supertree has few false edges) • High false negative (FN) rate (Estimated supertree is missing many true edges)

  23. Methods that Led to SuperFine • The Strict Consensus Merger (SCM) (Huson et al. 1999) • Quartet MaxCut (QMC) (Snir and Rao 2008)

  24. 1 5 6 3 5 1 2 4 2 7 4 5 1 2 4 Quartet MaxCut (QMC) QMC is a heuristic for the following optimization problem: Given a collection Q of quartet trees, find a supertree T, with leaf set L(T) = qQ L(q), that displays the maximum number of quartet trees in Q.

  25. 1 3 3 5 6 3 5 1 4 2 4 6 2 4 4 2 6 7 4 5 7 3 5 Maximizing # of Quartet Trees Displayed • 12|34, 23|45, 34|56, 45|67 are compatible quartet trees with supertree • Adding the quartet 17|23 creates an incompatible set of quartet trees. An “optimal” supertree would be the same as above, because it agrees with 4 out of 5 quartet trees.

  26. QMC as a Supertree Method • Step 1: Encode source trees as a set of quartets • Step 2: Apply QMC

  27. Idea behind SuperFine • First, construct a supertree with low false positives using SCM The Strict Consensus Merger • Then, refine the tree to reduce false negatives by resolving each polytomy using QMC Quartet Max Cut

  28. Resolving a single polytomy, v • Step 1: Encode each source tree as a collection of quartet trees on {1,2,...,d}, where d=degree(v) • Step 2: Apply Quartet MaxCut (Snir and Rao) to the collection of quartet trees, to produce a tree t on leafset {1,2,...,d} • Step 3: Replace the star tree at v by tree t Why?

  29. b 1 e 1 a 1 e b f 6 a c 1 d 4 g 5 f g a b 1 1 c 1 2 3 h d h i j i j 4 5 6 c 1 a c e b h 2 d g f d 4 i j 3 3 Back to Our Example

  30. b e a 1 e b f 6 a c d 4 g 5 f g a b c h d 1 i j c h 2 3 d 4 i j Where We Use the Theorem For every sin S, ∑(T|L(s))  ∑(s)

  31. b e a 1 4 1 f 6 6 5 c d 4 g 5 a b 1 4 1 2 3 2 4 3 c h d i j Step 1: Encode each source tree as a collection of quartet trees on {1,2,...,d}

  32. 1 4 6 5 1 4 2 3 Step 2: Apply Quartet MaxCut (QMC) to the collection of quartet trees 5 1 4 QMC 2 3 6

  33. b e c a e b a f g c h j d i j i i j a c e b Replace polytomy using tree from QMC g 5 d 4 1 2 3 h 6 f h d g f

  34. False Negative Rate Scaffold Density (%)

  35. False Negative Rate Scaffold Density (%)

  36. False Positive Rate Scaffold Density (%)

  37. Running Time SuperFine vs. MRP MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Scaffold Density (%) Scaffold Density (%)

  38. Observations • SuperFine is much more accurate than MRP, with comparable performance only when the scaffold density is 100% • SuperFine is almost as accurate as CA-ML • SuperFine is extremely fast

  39. Future Work • Exploring algorithm design space for Superfine • Different quartet encodings • Not using SCM in Step 1 • Parallel version • Post-processing step to minimize Sum-of-FN to source trees • Using Superfine to enable phylogeny estimation • without an alignment • on many marker combined datasets • Using Superfine in conjunction with divide-and-conquer methods to create more accurate phylogenetic methods • Exploration of impact of source tree collections (in particular the scaffold) on supertree analyses • Revisiting specific biological supertrees

More Related