Fast Algorithms for Minimum Evolution

1 / 39

Fast Algorithms for Minimum Evolution - PowerPoint PPT Presentation

Fast Algorithms for Minimum Evolution. Richard Desper, NCBI Olivier Gascuel, LIRMM. Overview. Statement of phylogeny reconstruction problem and various approaches to solving it. Tree length formula as a function of average distances. Greedy algorithms for tree building and tree swapping.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about 'Fast Algorithms for Minimum Evolution' - maribeth

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Fast Algorithms for Minimum Evolution

Richard Desper, NCBI

Olivier Gascuel, LIRMM

Overview
• Statement of phylogeny reconstruction problem and various approaches to solving it.
• Tree length formula as a function of average distances.
• Greedy algorithms for tree building and tree swapping.
• Simulation results.
• A few extras regarding consistency and branch lengths.
Phylogeny Reconstruction
• General problem: reconstruct the evolutionary history for a set L of extant species.
• Input: multiple sequence alignment for L or matrix of estimates of pairwise evolutionary distances.
• Output: weighted phylogeny representing history of L and common ancestors.
Methods
• Likelihood methods: model-based likelihood maximization.
• Parsimony methods: minimize total number of mutations in tree.
• Distance methods: fit tree structure to inferred evolutionary distances. Leading methods include Felsenstein-Fitch-Margoliash weighted least-squares and Neighbor-Joining and its variants.
Felsenstein-Fitch-Margoliash Least-squares Method
• FITCH searches the space of topologies by iteratively adding leaves and by tree swapping.
• Edge weights and topology are chosen to minimize the sum of squares (D is the input metric, DT is the induced tree metric):

If sij= 1 for alliandj, this is called the ordinary least-squaresmethod.

Minimum Evolution
• Developed by Rzhetsky and Nei (1992) as a modification of the OLS method
• For each topology T,
• Define function lassigning OLS lengths to edges of T
• Define size of tree
• Choose Tminimizing l(T )

For

Recursive Definition of DA|B

If A = {a}, B = {b}, DA|B = Dab,

All average distances for all pairs of non-intersecting subtrees of a

given topology can be calculated inO(n2)time.

A

e

i

B

External OLS Edge Length Function

If e is the edge connecting the leaf i to the

subtrees A and B,

A

C

e

D

B

Internal OLS Edge Length Function

The length of the edge eis (Vach, 1988)

where

B

A

e

D

C

Tree length formula

Lemma: with T as to the right,

let denote the root of subtree X,

and the edge to X for

Then,

With T as in prior slide, Tree Length Formula

Using lemma and branch length formula for l(e),

General approach
• To search the space of topologies, we’ll keep in memory two data structures:
• Sizes of each subtree of given topology
• Matrix of average distances DX|Y for X,Y disjoint subtrees in given topology
• As we move from one topology to another, we’ll update the matrix, but only as much as needed, in an efficient manner.

B

A

C

A

e

e

D

B

D

C

Tree Swapping by NNI

NNI swapping is a basic step in topology building and searching

With T as in prior slide, Tree Length Formula

Using lemma and branch length formula for l(e),

Tree Length after NNI

Given TgT’ the tree swap in prior slide, lthe edge length function:

(1)

wherel and l’ are constants depending on

the topologies.

OLS: FASTNNI
• Pre-compute average distances between non-intersecting sub-trees. (O(n2)computations)
• Loop over all internal edges, select the best swap using Equation (1). (O(n))
• If no swap improves length of the tree, stop and return the tree, else perform the best swap and update the matrix of average distances and repeat Step 2. (O(n)per swap; there is only one new split.)

Thus, if we require p swaps, the total complexity of

FASTNNI is O(n2 + pn).

Balanced Minimum Evolution
• Gascuel (2000) observed that the OLS/ME method was weaker than NJ in approximating the correct topology.
• Pauplin (2000) to simplify tree length computation proposed to use a “balanced” version of Minimum Evolution, weighting each sub-tree equally when calculating averages: if A and B are sub-trees of T, with
BNNI
• Calculate balanced averages of all pairs of sub-trees. (O(n2))
• Calculate improvement for each swap using

(2)

• If no tree swap improves length of the tree, stop and return tree, else update matrix of average distances and repeat Step 2. (O(ndiam(T)) per swap)

The average complexity, when performing p swaps, is

O(n2 + pn diam(T)).

y

If we perform the B-C tree swap, then we must recalculate

Typical values for diam(T):

Yule-Harding distribution:

Uniform distribution:

Updating Subtree Averages

T

x

X

A

C

e

Y

B

D

Q: How many recalculations?

(Hint: you can count (x,y) pairs).

A: O(n diam(T))

Building trees from scratch

We have NNI algorithms for OLS and balanced branch lengths. But what if we have no initial topology for NNIs?

OLS: Greedy Minimum Evolution
• For k=4 to n,
• Calculate Dk|A for each subtree A in Tk-1
• Express cost of inserting k along edgeeas f(e).

(Use Equation (3) on the next slide.)

• Choose e minimizing f. Insert k along eto form Tk.
• Update matrix of average distances between every pair of 2-distant subtrees.

GME runs in O(n2)running time

T’

C

C

T

k

k

A

A

B

B

Greedy Minimum Evolution

We use a variant of Equation (1), where D = {k}.Let L = l(T).

Then

Balanced Minimum Evolution

Same as GME,except:

• (modifications)
• Calculate balanced average distances instead of ordinary average distances
• Use l = ½ to find weights for insertion points
• Must keep average distances for all pairs of sub-trees.

BME runs in O(n2 diam(T)) running time.

Simulations
• Created 24- and 96-taxon trees, 2000 per each size, Yule-Harding process (g molecular clock).
• Edge lengths multiplied by (1.0 + mX), where X is exponentially distributed.
• Generated trees with three rates of evolution
• SeqGen used to generate sequences for each tree and rate (12,000 in all)
• DNADIST used to calculate distance matrices
Results: topological distances

BNNI improved

all input trees

Results: topological distances

This improvement

is large with fast rates

and high numbers of taxa

Results: topological distances

NNI trees are close to the

best possible for BME

Results: topological distances

The quality of the

NNI tree is (mostly)

independent

of starting point

Computational Times

in (MM:SS)

Computations done on Sun Enterprise E4500/E5500 running Solaris 8

on 10 400-Mhz processors with 7 Gb memory.

Average number of NNIs

We see that the average number of NNIs is

considerably lower than the number of taxa.

BME = WLS

Why does the balanced approach work so well?

• Pauplin’s formula for the length of a tree is
• BME is a weighted least squares approach with

Where pT(i,j) is the length of the (i,j) path in T.

Distantly related taxa see their importance

decrease exponentially.

Bonus features
• BME is a consistent method. As observed distances converge to true distances, the true topology becomes the minimum evolution tree.
• The BNNI tree has no negative branch lengths. A negative value to the branch length function implies a NNI leading to a smaller tree.
Consistency of Balanced ME
• Theorem: SupposeS is a weighted tree, andTis a treetopologyincompatible withS. Let T be the tree of topology T with weights determined by the balanced scheme. Then

l(T) > l(S).

• Lemma: it suffices to prove the case when S is a split metric.
Balanced ME consistency
• Basic idea: let l be the tree length function on the space of topologies. We find a sequence of topologies, T=T0, T1, ... Tk=S such that
• Each Ti+1 can be reached from Ti via one of two simple topological transformations
• l(Ti) > l(Ti+1) for all i.

Proof structure modeled after OLS/ME proof (Rzhetsky and Nei, 1993).

D

D

A

A

C

B

B

C

Type I transformation

Color the leaves black or white according to the split metric S. A

Type I transformation uses a NNI to form a larger monochromatic

cluster

This transformation reduces the size of the tree under l

A1

A1

B1

A2

C

C

A2

B1

B2

B2

Type II transformation

A Type II transformation uses two NNIs to form two monochromatic

subtrees

This transformation also reduces the value of the size of the tree

under l

Positive Branch Lengths after BNNI

Recall that the length of an edge is described by

B

D

A

e

C

We do not perform the switch because

i.e.

Thus

Similarly,

Conclusions
• BME + BNNI runs in O((n2 + pn)diam(T)), outputs trees comparable to (better than) FITCH, Weighbor, BioNJ, or NJ.
• FastME is faster than NJ or its variants.
• BNNI consistently improved output trees in all settings, even when WLS/Fitch trees were input.
• BNNI outputs tree without negative branch lengths.
• FASTME software available at http://www.ncbi.nlm.nih.gov/CBBResearch/Desper/FastME.html or http://www.lirmm.fr/~w3ifa/MAAS/.