Recovering evolutionary history Björn Nystedt Bioinformatics scientist, SciLifeLab

Recovering evolutionary history Björn Nystedt Bioinformatics scientist, SciLifeLab bjorn.nystedt@scilifelab.se

SciLifeLab ScieLifeLab Strategic money from the government. Nodes in Stockholm (KI, SU, KTH) and Uppsala (UU). “THE VISION is to become one of the leading centers in the world for high-throughput bioscience with focus on genomes, protein profiling and bioinformatics with relevance for human diseases” • 10 platforms • Genomics • Bioinformatics • Clinical diagnostics • … Capacity Approaching 300 human genomes/week (30X) Approaching 10,000 single-cell transcriptomes/day

Fig 1. Growth of DNA sequencing. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLoS Biol 13(7): e1002195. doi:10.1371/journal.pbio.1002195 http://127.0.0.1:8081/plosbiology/article?id=info:doi/10.1371/journal.pbio.1002195

Earnst Haeckel (1834-1919)

Woese and 16S rRNA Woese CR, Kandler O, Wheelis ML. (1990) Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S, 87:4576-9

Re-writing the tree of life Eukaryotes nested within archaea? Spang et al. (2015) Nature 521:173–179

Woeserevisited Eukaryota

The Eukaryotic tree of life The Eukaryotic Tree of Life from a Global Phylogenomic Perspective. Cold Spring Harb Perspect Biol. 2014. 6:a016147

Understanding change Amniote phylogeny based on protein synonymous sites showing major features of amniote evolution. J Alföldi et al. Nature (2011) “Nothing in biology makes sense except in the light of evolution” Dobzhansky 1973

What is a phylogeny A phylogeny is a pattern of event histories shared between biological replicators. In practice, the by far most commonly used replicators are genes (or species, in some sense). A phylogeny is typically modeled as a tree, representing the historical events linking the replicators together.

What are we trying to do? True history ATCGTGT ATCCTGT ATAGTGT ATCGTGT ATAGTGT ATCGTGA ATCGTGT ATAGTGT time Species A Species B Species C

What are we trying to do? Observed data ATCGTGA ATCGTGT ATAGTGT time Species A Species B Species C

What are we trying to do? Inferred history (“find most probable history given the observations”) Note! Normally we do not explicitly infer the ancestral states ATCGTGA ATCGTGT ATAGTGT Relative time Species A Species B Species C

Terminology

Topology and branch lengths • A tree has information in its • topology (the order of branch splits) • branch lengths Typically, the topology reflect relatedness (more on this later), and the branch lengths represents amount of change. Little change per time unit / slow evolution A lot of change per time unit / fast evolution A B C D phylogram

Cladogram Phylogram - A phylogenetic tree with branch lengths relative to the amount of change Cladogram - A phylogenetic tree with uninformative branch lengths C A B C D D B A phylogram cladogram (of the same tree)

Collect and organize your data

Collect your data (homologs) Phylogeny can only be performed on characters which are related by decent, ie they share a common history (within a reasonable timeframe). Sequences related by decent are called ‘homologs’

Organize your data (alignment)

Alignment consequences Multiple alignment => much stronger assumption about homology Not just the genes are assumed to be homologous (in some fluffy sense), but each column in the alignment is assumed to contain only homologous characters. In effect, we assume each column to carry a signal of the same underlying gene tree. Ancestral state ATCCCTTCTATTTGA ATCCGCTCTATATGG ATCCGCTCTATATGA ATGCCTA-TCCTAGA ATGTCTA-TCCTTGA Look at your alignment! If the alignment does not make sense, your tree won’t make sense either

Evolutionary models

Observed differences (p-distance) Actual changes Naïve distances (p distance) ATCGTGTG ATCCTGTG ATCATGTG ATCGTGTG ATCATGTG time Actual change: 3 Observed diff: 1 This effect is due to multiple substitutions at the same site!

Observed differences (p-distance) A G C T Actual changes , if i≠j , if i=j Jukes-Cantor model Example: Jukes-Cantor model P(t)=eQt • Jukes-Cantor is the simplest model in a class of models called time-reversible models for DNA (1 parameter)

Transitions and transversions So, equal substitution rates for different nucleotide substitution types seems to be a bad assumption! ‘GTR’ is the most complex time-reversible model (6 parameters)

More complex models • # parameters • nt aa • Base/aa frequencies 3 19 • Substitution rates 6 210* • Rate heterogeneity among sites (gamma) 1 1 • Fraction of invariant sites (1) (1) • Tree topology (1) (1) • * Empirical values, not estimated for each dataset • Special cases • Non-reversible substitution rates • Rate heterogeneity among branches • Time-constrained trees (fossil data) • Clock-like trees

Methods (using one of the previous models)

Algorithmic: Neighbor-Joining (NJ) Use pairwise distances (based on your model of choise) Use iterative approach to build one tree (no such thing as a “second-best tree”). + Very fast Surprisingly accurate - Fails in complicated cases No possibility to compare to alternative trees

Methods based on criteria Calculate a score for all (or at least many) possible trees, and pick the one with the best score.

Maximum likelihood criterion • Given two trees, the one with the higher likelihood, i.e. the one with the higher conditional probability of observing the data, is the better • Site likelihood is the conditional probability of the data at one site (one character) given the assumed model of evolution and parameters of the model • Data set likelihood is the product of the site likelihoods (character independence) • Likelihood values under different models are comparable, thus giving us a way to test the adequacy of the model • The model consists of • A substitution model, e.g. Jukes-Cantor • A tree with branch lengths

αt Taxon1 AC Taxon2 CC For Jukes-Cantor! Ltot=L1·L2, or log Ltot = logL1+logL2 Likelihood of a one-branch tree Taxon1 AC Taxon2 CC

αt αt= 0.02327 lnL= -51.133956 lnL A one-branch tree 30 nucleotides from ψη-globin genes of two primates on a one-edge tree * * Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGG Orangutan GGACTCCTTGAGAAATAAACTGCACACTGG There are two differences and 28 similarities Possible (and quick) to optimize parameters for a given tree.

Likelihood of a 4-taxon tree • Bases at internal nodes are unknown (so sum over all possible states!) A C e1 e3 e5 u v e2 e4 A T

Comparing trees Calculating the likelihood for a given tree is (pretty) fast. So, all we need to do now is to compute the Likelihood for all possible trees, and pick the best one! Easy, but..

..it’s a forest out there! • Number of (rooted) trees for n terminals is (2n-3)·(2n-5)·(2n-7)…3·1 • 3 taxa -> 3 trees • 4 taxa -> 15 trees • 10 taxa -> 34 459 425 trees • 25 taxa -> 1,19·1030 trees • 52 taxa -> 2,75·1080 trees • Finding the optimal tree is an NP-complete or NP-hard problem Search strategies • Exact • Will find the best tree (according to criterion • ExhaustiveUp to ca 10 taxa • Heuristic • Limits the search to a“reasonable” set of trees. May not find the optimal tree

Star decomposition B A A A A C C C E D B D E B E D C E C D E B A B D … B A E E E A C C C A B B D D D

B A C A B B A A C 831 837 E D C A B D D C C 783 D B B E E B A C C D C C A A A B B D E D D E 914 C 921 A B D 915 916 905 Stepwise addition

C D A H B G I E F C C D D D C A B H F H B A I G E A G I E E B F F C H H A D H G G A C I C G F I A D D I I B E B G F F E H B E Branch swapping SPR TBR

Trapped in local optimum?

Reliability

4. Evaluate reliability (bootstrap) Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG AC--ACC ACG-AGG GTGTAAG

4. Evaluate reliability (bootstrap) Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG G AC--ACC - ACG-AGG G GTGTAAG G !

4. Evaluate reliability (bootstrap) Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GC AC--ACC -C ACG-AGG GG GTGTAAG GA !

4. Evaluate reliability (bootstrap) Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GCA AC--ACC -CA ACG-AGG GGA GTGTAAG GAA !

4. Evaluate reliability (bootstrap) Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GCAG AC--ACC -CA- ACG-AGG GGAG GTGTAAG GAAG !

4. Evaluate reliability (bootstrap) Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GCAGA AC--ACC -CA-A ACG-AGG GGAGA GTGTAAG GAAGG !

4. Evaluate reliability (bootstrap) Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GCAGAT AC--ACC -CA-A- ACG-AGG GGAGA- GTGTAAG GAAGGT !

4. Evaluate reliability (bootstrap) Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GCAGATA AC--ACC -CA-A-A ACG-AGG GGAGA-A GTGTAAG GAAGGTA !

4. Evaluate reliability (bootstrap) Idea: A reliable phylogeny can be inferred also from subset of the data. Therefore: Try estimating phylogeny from parts of data. What subtrees are persistent? Definition: A replicate of a multialignment is achieved by column sampling with replacement. ACGTACG GCAGATA AC--ACC -CA-A-A ACG-AGG GGAGA-A GTGTAAG GAAGGTA / / 1st pseudo-replicate dataset

Original analysis, e.g. MP, ML, NJ. Original data set with n characters. Ceus Aus Beus Draw n characters randomly with re-placement. Repeat m times. Repeat original analysis on each of the pseudo-replicate data sets. Deus Ceus Aus Ceus Aus Ceus Aus Ceus Aus Beus Beus Deus Beus Ceus Aus Deus Deus Beus Ceus Aus Deus Beus Deus Beus m pseudo-replicates, each with n characters. Deus Evaluate the results from the m analyses. Ceus Aus 75% Beus Deus Bootstrap • Bootstrap proportions between 0.5 and 1 can be interpreted as a measure of confidence or support • Rule of thumb: 80% support is “good” • Bootstrap support values are not probabilities! • Valules below 0.5 are non-sense

precision accuracy Precision and accuracy Low precision High accuracy => Low bootstrap value High precision Low accuracy => High bootstrap value

Recovering evolutionary history Björn Nystedt Bioinformatics scientist, SciLifeLab