1 / 31

SuperFine , Enabling Large -Scale Phylogenetic Estimation

SuperFine , Enabling Large -Scale Phylogenetic Estimation. Shel Swenson University of Southern California and Georgia Institute of Technology. Phylogeny (evolutionary tree). Orangutan. Human. Gorilla. Chimpanzee.

laban
Download Presentation

SuperFine , Enabling Large -Scale Phylogenetic Estimation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology

  2. Phylogeny(evolutionary tree) Orangutan Human Gorilla Chimpanzee “Nothing in Biology makes sense except in the light of evolution” – Dobhzhansky 1 3 2 (1-3) From the Tree of the Life Website,University of Arizona

  3. Tree of Life, Importance to Biology Biomedical applications Mechanisms of evolution Tracking ancient migrations Protein structure and function Drug design We are here 1 2 3 1) Nature Reviews (Genetics) 2) Howard Hughes Medical Institute (BioInteractive) 3) 1000 Genomes Project

  4. -3 million yrs AAGACTT -2 million yrs AAGGCCT TGGACTT AAGGCCT TGGACTT -1 million yrs AGGGCAT TAGCCCT AGCACTT AGGGCAT TAGCCCT AGCACTT TAGCCCA TAGACTT AGCACAA AGCGCTT today AGGGCAT DNA sequence evolution (idealized) AAGACTT AAGGCCT AAGGCCT TGGACTT TGGACTT AGGGCAT TAGCCCT AGCACTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT

  5. Phylogeny Problem U V W X Y AGACTA TGGACA TGCGACT AGGTCA AGATTA X U Y V W U V W X Y

  6. Two basic approaches for tree estimation on multi-gene datasets • Apply phylogeny estimation methods to concatenated (“combined”) sequence alignments for different genes • Compute trees on individual genes and apply a supertree method This Talk:SuperFine, boosts supertree methods, enabling faster, more accurate estimation for large scale problems

  7. gene 1 gene 3 S1 TCTAATGGAA S1 S2 gene 2 GCTAAGGGAA TATTGATACA S3 S3 TCTAAGGGAA TCTTGATACC S4 S4 S4 TCTAACGGAA GGTAACCCTC TAGTGATGCA S7 S5 TCTAATGGAC GCTAAACCTC S7 TAGTGATGCA S8 S6 TATAACGGAA GGTGACCATC S8 CATTCATACC S7 GCTAAACCTC Using multiple genes

  8. Concatenation gene 2 gene 3 gene 1 S1 TCTAATGGAA ? ? ? ? ? ? ? ? ? ? TATTGATACA S2 GCTAAGGGAA ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S3 TCTAAGGGAA TCTTGATACC ? ? ? ? ? ? ? ? ? ? S4 TCTAACGGAA GGTAACCCTC TAGTGATGCA S5 GCTAAACCTC ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S6 GGTGACCATC ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S7 TCTAATGGAC GCTAAACCTC TAGTGATGCA S8 TATAACGGAA CATTCATACC ? ? ? ? ? ? ? ? ? ?

  9. Two competing approaches Analyze separately . . . Supertree Method gene 1gene 2 . . . gene k Species . . . Concatenation

  10. Missing data Large dataset sizes Incompatible data types (e.g., morphological features, biomolecular sequences, gene orders, even distances based upon biochemistry) Unavailable sequence data (only trees) Why use supertree methods?

  11. MRP weighted MRP Min-Cut Modified Min-Cut Semi-strict Supertree MRF MRD QILI SDM Q-imputation PhySIC Majority-Rule Supertrees Maximum Likelihood Supertrees and many more ... Matrix Representation with Parsimony (Most commonly used and among most accurate) Many Supertree Methods

  12. Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) FP 50% error rate

  13. FN rateMRP vs. Concatenation MRP Concatenation FN Rate (%) Scaffold Density (%) Concatenation is not always an option We need better supertree methods

  14. FN RateSuperFine vs. MRP and Concatenation MRP SuperFine Concatenation FN Rate (%) Scaffold Density (%)

  15. Running TimeSuperFine vs. MRP (Concatenation is much slower) MRP SuperFine Minutes MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Scaffold Density (%) Scaffold Density (%)

  16. Idea behind SuperFine • Construct a supertree with low false positive rate • Reduce false negatives by resolving areas of uncertainty using a supertree methodQuartet Max Cut (Swenson et al., Systematic Biology, 2011)

  17. c d c a e a d e b f b f T B(T) = {ab|cdef, abc|def,abcd|ef} T’ B(T’) = {ab|cdef, abc|def} Bipartitions and refinement Let B(T) denote the set of (non-trivial) bipartitions induced by the edges of T. TrefinesT’ (T’≤T) if B(T)  B(T’) Polytomy Refinement

  18. Idea behind SuperFine • Construct a supertree with low FP using the Strict Consensus Merger (SCM) (Huson et al. 1999) • Reduce FN by resolving each polytomy using a supertreemethodQuartet Max Cut

  19. e b a e c b a f g d f g b a c h d i j c h i j d Strict Consensus Merger (SCM) b e a f c d g a b c h d i j

  20. e b a f g c h d i j Property of SCM: Bipartitions in SCM tree correspond to bipartitions in the source trees e b b e a a c f c d f g g d a b b a c h c i j d h d i j Swenson, Ph.D. Thesis, 2009

  21. Performance of SCM • Low false positive (FP) rate (Estimated supertree has few false edges) • High false negative (FN) rate (Estimated supertree is missing many true edges) • Runs in polynomial time (in the number of source trees and total number of species)

  22. Idea behind SuperFine • Construct a supertree with low FP using SCM • Refine the tree to reduce FN by resolving each polytomy using a supertree method (eg. MRP)Quartet Max Cut

  23. Resolving a single polytomy, v • Step 1: Reduce each source tree to a tree on {1,2,...,d}, where d=degree(v) • Step 2: Apply MRP to the collection of reduced trees, to produce a tree t on leafset {1,2,...,d} • Step 3: Replace the star tree at v by tree t

  24. b 1 e 1 a 1 e b f 6 a c 1 d 4 g 5 f g a b 1 1 c 1 2 3 h d h i j i j 4 5 6 c 1 a c e b h 2 d g f d 4 i j 3 3 Back to Our Example

  25. b e a 1 e b f 6 a c d 4 g 5 f g a b c h d 1 i j c h 2 3 d 4 i j Where We Use the Property

  26. b e a f c d g a b c h d i j Step 1: Reduce each source tree to a tree on the set {1,2,...,d} 1 6 4 5 1 2 4 3

  27. 1 4 6 5 1 4 2 3 Step 2: Apply MRP to the collection of reduced trees 5 MRP 1 4 MRP 2 3 6

  28. b e c a e b a f g c h j d i j i i j a c e b Replace polytomy using tree from MRP g 5 d 4 1 2 3 h 6 f h d g f

  29. FN RateSuperFine vs. MRP and Concatenation MRP SuperFine Concatenation FN Rate (%) Scaffold Density (%)

  30. Running TimeSuperFine vs. MRP (Concatenation is much slower) MRP SuperFine Minutes MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Scaffold Density (%) Scaffold Density (%)

  31. SuperFine: Boosting supertree methods • Superfine+MRP vs. MRP (Swenson et al. 2011) • SuperFinecombines the features of the SCM method (polynomial time, low false positive rates) with the lower false negative rate of MRP, to achieve greater accuracy in less time. • Speed-up results from the re-encoding of source trees as smaller trees. • SuperFine+QMC vs. QMC (quartet-based) • QMC (Snir 2008), polynomial time, but infeasible for 500+ taxa • SuperFine+QMC, runs where QMC cannot (Swenson et al. 2010) • SuperFine+MRL vs. MRL (likelihood) (Nguyen et al. 2012) • SuperFine+MRL, faster and more accurate, similar likelihood scores DACTAL (Nelesen, et al. 2012) Boosting concatenation methods; uses SuperFine in its divide-and-conquer strategy

More Related