New methods for estimating species trees from genome-scale data

New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois

Phylogeny(evolutionary tree) Orangutan Human Gorilla Chimpanzee From the Tree of the Life Website,University of Arizona

Phylogenomics = Species trees from whole genomes “Nothing in biology makes sense except in the light of evolution” - Dobhzansky

The Tree of Life: MultipleChallenges • Scientific challenges: • Ultra-large multiple-sequence alignment • Alignment-free phylogeny estimation • Supertree estimation • Estimating species trees from many gene trees • Genome rearrangement phylogeny • Reticulate evolution • Visualization of large trees and alignments • Data mining techniques to explore multiple optima • Theoretical guarantees under Markov models of evolution • Applications: • metagenomics • protein structure and function prediction • trait evolution • detection of co-evolution • systems biology • Techniques: • Graph theory (especially chordal graphs) • Probability theory and statistics • Hidden Markov models • Combinatorial optimization • Heuristics • Supercomputing

Incomplete Lineage Sorting (ILS) is a dominant cause of gene tree heterogeneity

Gene trees inside the species tree (Coalescent Process) Past Present Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

Incomplete Lineage Sorting (ILS) • Confounds phylogenetic analysis for many groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substantial debate about how to analyze phylogenomic datasets in the presence of ILS, focused around statistical consistency guarantees (theory) and performance on data.

Avian Phylogenomics Project MTP Gilbert, Copenhagen T. Warnow UT-Austin G Zhang, BGI E Jarvis, HHMI S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin Plus many many other people… • Approx. 50 species, whole genomes, 14,000 loci • Jarvis, Mirarab, et al., Science 2014 • Major challenges: • Concatenation analysis took > 250 CPU years, and suggested a rapid radiation • We observed massive gene tree heterogeneity consistent with incomplete lineage sorting • Very poor resolution in the 14,000 gene trees (average bootstrap support 25%) • Standard coalescent-based species tree estimation methods contradicted concatenation analysis and prior studies

1KP: Thousand Transcriptome Project T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant • 103 plant transcriptomes, 400-800 single copy “genes” • Next phase will be much bigger • Wickett, Mirarab et al., PNAS 2014 • Challenges: • Massive gene tree heterogeneity consistent with ILS • Could not use existing coalescent methods due to missing data (many gene trees could not be rooted) and large number of species

This talk • Gene tree heterogeneity due to incomplete lineage sorting, modelled by the multi-species coalescent (MSC) • Statistically consistent estimation of species trees under the MSC, and the impact of gene tree estimation error • New methods in phylogenomics: • Statistical binning (Science 2014) and Weighted Statistical Binning (PLOS One 2015): improving gene trees • ASTRAL (Bioinformatics 2014, 2015): quartet-based estimation • Open questions

Sampling multiple genes from multiple species Orangutan Human Gorilla Chimpanzee From the Tree of the Life Website,University of Arizona

A species tree defines a probability distribution on gene trees under the Multi-Species Coalescent (MSC) Model Past Present Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

Statistical Consistency error Data

Main competing approaches gene 1gene 2 . . . gene k . . . Analyze separately . . . Summary Method Species Concatenation

Statistically consistent under MSC? • CA-ML (Concatenation using unpartitioned maximum likelihood) - NO • Most frequent gene tree – NO • Minimize Deep Coalescences (MDC) – NO • Greedy Consensus (GC) – NO • Matrix Representation with Parsimony (MRP, supertree method) – NO Hence, none of these standard approaches are proven to converge to the true species tree as the number of loci increases. Many of them are positively misleading (will converge to the wrong tree)!

Anomaly zone • The most probable gene tree on a set S of species may not be species tree on S (anomaly zone, ask James Degnan and Noah Rosenberg), except for: • rooted three-species trees • unrooted four-species trees

Summary Methods . . .

Summary Methods . . . • Computing rooted species tree from rooted gene trees: • For every three species {a,b,c}, • record most frequent rooted gene tree on {a,b,c} • Combine rooted three-leaf gene trees into rooted tree if they are compatible • Theorem: This algorithm is statistically consistent under the MSC and runs in polynomial time.

Summary Methods . . . • Computing unrooted species tree from unrooted gene trees: • For every four species {a,b,c,d}, • record most frequent unrooted gene tree on {a,b,c,d} • Combine unrooted four-leaf gene trees into unrooted tree if they are compatible (recursive algorithm based on finding sibling pairs and removing one sibling) • Theorem: This algorithm is statistically consistent under the MSC and runs in polynomial time.

Statistically consistent under ILS? • Coalescent-based summary methods: • MP-EST (Liu et al. 2010): maximum pseudo-likelihood estimation of rooted species tree based on rooted triplet tree distribution – YES • BUCKy-pop (Ané and Larget 2010): quartet-based Bayesian species tree estimation – YES • And many others (ASTRAL, ASTRID, NJst, GLASS, etc.) - YES • Co-estimation methods: *BEAST (Heled and Drummond 2009): Bayesian co-estimation of gene trees and species trees – YES Co-estimation methods are too slow to use on most datasets… hence the debate is largely between concatenation (traditional approach) and summary methods. • Single-site methods (SMRT, SVDquartets, METAL, SNAPP, and others) - YES • CA-ML (Concatenation using unpartitioned maximum likelihood) - NO • MDC – NO • GC (Greedy Consensus) – NO • MRP (supertree method) – NO

Results on 11-taxon datasets with weak ILS *BEASTmore accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinformatics 2013

Results on 11-taxon datasets with weak ILS *BEAST MORE ACCURATE than summary methods, because *BEAST gets more accurate gene trees! *BEASTmore accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinformatics 2013

Results on 11-taxon datasets with weak ILS Summary methods (BUCKy-pop, MP-EST) are both statistically consistent under the MSC but are impacted by gene tree estimation error *BEASTmore accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinformatics 2013

Results on 11-taxon datasets with weak ILS Concatenation (RAxML) best of all methods on these data! (However, for high enough ILS, concatenation is not as accurate as the best summary methods.) *BEASTmore accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinformatics 2013

Impact of Gene Tree Estimation Error on MP-EST MP-EST has no error on true gene trees, but MP-EST has 9% erroron estimated gene trees Datasets: 11-taxon strongILS conditions with 50 genes Similar results for other summary methods (MDC, Greedy, etc.)

TYPICAL PHYLOGENOMICS PROBLEM: many poor gene trees • Summary methods combine estimated gene trees, not true gene trees. • Multiple studies show that summary methods can be less accurate than concatenation in the presence of high gene tree estimation error. • Genome-scale data includes a range of markers, not all of which have substantial signal. Furthermore, removing sites due to model violations reduces signal. • Some researchers also argue that “gene trees” should be based on very short alignments, to avoid intra-locus recombination.

Gene tree estimation error: key issue in the debate • Summary methods combine estimated gene trees, not true gene trees. • Multiple studies show that summary methods can be less accurate than concatenation in the presence of high gene tree estimation error. • Genome-scale data includes a range of markers, not all of which have substantial signal. Furthermore, removing sites due to model violations reduces signal. • Some researchers also argue that “gene trees” should be based on very short alignments, to avoid intra-locus recombination.

What is the impact of gene tree estimation error on species tree estimation? • Question: Do any summary methods converge to the species tree as the number of loci increase, but where each locus has only a constant number of sites? • Answers: Roch& Warnow, SystBiol, March 2015: • Strict molecular clock: Yes for some new methods, even for a single site per locus • No clock: Unknown for all methods, including MP-EST, ASTRAL, etc. S. Roch and T. Warnow. "On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods", Systematic Biology, 64(4):663-676, 2015, (PDF)

Avian Phylogenomics Project MTP Gilbert, Copenhagen T. Warnow UT-Austin G Zhang, BGI E Jarvis, HHMI S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin Plus many many other people… • Approx. 50 species, whole genomes, 14,000 loci • Jarvis, Mirarab, et al., Science 2014 • Major challenges: • Concatenation analysis took > 250 CPU years, and suggested a rapid radiation • We observed massive gene tree heterogeneity consistent with incomplete lineage sorting • Very poor resolution in the 14,000 gene trees (average bootstrap support 25%) • Standard coalescent-based species tree estimation methods contradicted concatenation analysis and prior studies

Avian Phylogenomics Project MTP Gilbert, Copenhagen T. Warnow UT-Austin G Zhang, BGI E Jarvis, HHMI S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin Plus many many other people… • Approx. 50 species, whole genomes, 14,000 loci • Solution: Statistical Binning • Improves coalescent-based species tree estimation by improving gene trees (Mirarab, Bayzid, Boussau, and Warnow, Science 2014), see also weighted statistical binning (Bayzid et al., PLOS One 2015) • Avian species tree estimated using Statistical Binning with MP-EST • (Jarvis, Mirarab, et al., Science 2014)

Ideas behind statistical binning • “Gene tree” error tends to decrease with the number of sites in the alignment • Concatenation (even if not statistically consistent) tends to be reasonably accurate when there is not too much gene tree heterogeneity Number of sites in an alignment

Note: Supergene trees computed using fully partitioned maximum likelihood Vertex-coloring graph with balanced color classes is NP-hard; we used heuristic.

Theorem 3 (PLOS One, Bayzid et al. 2015):Unweighted statistical binning pipelines are not statistically consistent under GTR+MSC As the number of sites per locus increase: • All estimated gene trees converge to the true gene tree and have bootstrap support that converges to 1 (Steel 2014) • For each bin, with probability converging to 1, the genes in the bin have the same tree topology (but can have different numeric parameters), and there is only one bin for any given tree topology • For each bin, a fully partitioned maximum likelihood (ML) analysis of its supergene alignment converges to a tree with the common gene tree topology. As the number of loci increase: • every gene tree topology appears with probability converging to 1. Hence as both the number of loci and number of sites per locus increase, with probability converging to 1, every gene tree topology appears exactly once in the set of supergene trees. It is impossible to infer the species tree from the flat distribution of gene trees!

Theorem 2 (PLOS One, Bayzid et al. 2015): WSB pipelines are statistically consistent under GTR+MSC Easy proof: As the number of sites per locus increase • All estimated gene trees converge to the true gene tree and have bootstrap support that converges to 1 (Steel 2014) • For every bin, with probability converging to 1, the genes in the bin have the same tree topology • Fully partitioned GTR ML analysis of each bin converges to a tree with the common topology of the genes in the bin Hence as the number of sites per locus and number of loci both increase, WSB followed by a statistically consistent summary method will converge in probability to the true species tree. Q.E.D.

Weighted Statistical Binning: empirical WSB generally benign to highly beneficial: • Improves accuracy of gene tree topology • Improves accuracy of species tree topology • Improves accuracy of species tree branch length • Reduces incidence of highly supported false positive branches

Statistical binning vs. unbinned Datasets: 11-taxon strongILS datasets with 50 genes from Chung and Ané, Systematic Biology Binning produces bins with approximate 5 to 7 genes each

Statistical binning vs. Unbinned and Concatenation Species tree estimation error for MP-EST and ASTRAL, and also concatenation using ML, on avian simulated datasets: 48 taxa, moderately high ILS (AD=47%), 1000 genes, and varying gene sequence length. Bayzid et al., (2015). PLoS ONE 10(6): e0129183

Comparing Binned and Un-binned MP-EST on the Avian Dataset Unbinned MP-EST strongly rejects Columbea, a major finding by Jarvis, Mirarab,et al. Binned MP-EST is largely consistent with the ML concatenation analysis. The trees presented in Science 2014 were the ML concatenation and Binned MP-EST

Running Time Comparison • Concatenation analysis of the Avian dataset: • ~250 CPU years and 1Tb memory • Statistical binning analysis: • ~5 CPU years, almost all of which was computing maximum likelihood gene trees, much less memory usage Species tree estimation using traditional approaches is more computationally expensive, and not as accurate as coalescent-based methods!

Summary (so far) • Statistical binning (weighted or unweighted): improves gene trees, and leads to improved species trees in the presence of ILS compared to unbinned analyses. • Statistical binning pipelines are also more accurate than concatenation under high ILS. • Pipelines using weighted version are statistically consistent under the multi-species coalescent model. • Statistical binning pipelines are much faster than concatenation analyses (e.g. 5 years vs. 250 years for avian dataset).

1KP: Thousand Transcriptome Project T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant • 103 plant transcriptomes, 400-800 single copy “genes” • Wickett, Mirarab et al., PNAS 2014 • Next phase will be much bigger (~1000 species and ~1000 genes) • Challenges: • Massive gene tree heterogeneity consistent with ILS • Could not use existing coalescent methods due to missing data (many gene trees could not be rooted) and large number of species

1KP: Thousand Transcriptome Project T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant • 103 plant transcriptomes, 400-800 single copy “genes” • Wickett, Mirarab et al., PNAS 2014 • Next phase will be much bigger (~1000 species and ~1000 genes) • Solution: • New coalescent-based method ASTRAL (Mirarab et al., ECCB/Bioinformatics 2014, Mirarab et al., ISMB/Bioinformatics 2015) • ASTRAL is statistically consistent, polynomial time, and uses • unrooted gene trees.

Constrained Maximum Quartet Support Tree • Input: Set T = {t1,t2,…,tk} of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipartitions • Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipartitions from X. Theorems (Mirarab et al., 2014): • If X contains the bipartitions from the input gene trees (and perhaps others), then an exact solution to this problem is statistically consistent under the MSC. • The constrained MQST problem can be solved in O(|X|2nk) time. (We use dynamic programming, and build the unrooted tree from the bottom-up, based on “allowed clades” – halves of the allowed bipartitions.) Conjecture: MQST is NP-hard

New methods for estimating species trees from genome-scale data

New methods for estimating species trees from genome-scale data

Presentation Transcript

Methods for Estimating Distributions

Genome-Scale Mutagenesis

Large scale proteome comparisons Genome trees

From Gene Trees to Species Trees

Knowledge-based Analysis of Genome-scale Data

Gene Trees and Species Trees: Lessons from morning glories

“Species Trees”

From Gene Trees to Species Trees

Genome-scale phylogenomics

From Gene Trees to Species Trees

From Gene Trees to Species Trees

Estimating species trees from multiple gene trees in the presence of ILS

Species of Trees

Estimating Genealogies from Marker Data

New methods for estimating species trees from gene trees

Novel computational methods for large scale genome comparison

Inferring Trees from Trees Consensus and Supertree Methods

Methods for Estimating Defects

Estimating parameters from data

Estimating mortality from defective data

Genome of Drosophila species