Class 9 phylogenetic trees
Download
1 / 66

Class 9: Phylogenetic Trees - PowerPoint PPT Presentation


  • 301 Views
  • Updated On :

Class 9: Phylogenetic Trees. The Tree of Life. D’après Ernst Haeckel, 1891 . Evolution. Many theories of evolution Basic idea: speciation events lead to creation of different species Speciation caused by physical separation into groups where different genetic variants become dominant

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Class 9: Phylogenetic Trees' - Donna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

The tree of life l.jpg
The Tree of Life

D’après Ernst Haeckel, 1891


Evolution l.jpg
Evolution

  • Many theories of evolution

  • Basic idea:

    • speciation events lead to creation of different species

    • Speciation caused by physical separation into groups where different genetic variants become dominant

  • Any two species share a (possibly distant) common ancestor


Phylogenies l.jpg
Phylogenies

  • A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species

  • Leafs - current day species

  • Nodes - hypothetical most recent common ancestors

  • Edges length - “time” from one speciation to the next

Aardvark

Bison

Chimp

Dog

Elephant


Phylogenetic tree l.jpg

branch

internal node

leaf

Phylogenetic Tree

  • Topology: bifurcating

    • Leaves - 1…N

    • Internal nodes N+1…2N-2


Example primate evolution l.jpg
Example: Primate evolution

20-25 mya

35-37 mya

40-45 mya


How to construct a phylogeny l.jpg
How to construct a Phylogeny?

  • Until mid 1950’s phylogenies were constructed by experts based on their opinion (subjective criteria)

  • Since then, focus on objective criteria for constructing phylogenetic trees

    • Thousands of articles in the last decades

  • Important for many aspects of biology

    • Classification (systematics)

    • Understanding biological mechanisms


Morphological vs molecular l.jpg
Morphological vs. Molecular

  • Classical phylogenetic analysis: morphological features

    • number of legs, lengths of legs, etc.

  • Modern biological methods allow to use molecular features

    • Gene sequences

    • Protein sequences

  • Analysis based on homologous sequences (e.g., globins) in different species


Dangers in molecular phylogenies l.jpg
Dangers in Molecular Phylogenies

  • We have to remember that gene/protein sequence can be homologous for different reasons:

  • Orthologs -- sequences diverged after a speciation event

  • Paralogs -- sequences diverged after a duplication event

  • Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)


Dangers of paralogues l.jpg
Dangers of Paralogues

Gene Duplication

Speciation events

2B

1B

3A

3B

2A

1A


Dangers of paralogs l.jpg
Dangers of Paralogs

  • If we only consider 1A, 2B, and 3A...

Gene Duplication

Speciation events

2B

1B

3A

3B

2A

1A


Types of trees l.jpg
Types of Trees

  • A natural model to consider is that of rooted trees

Common

Ancestor


Types of trees13 l.jpg
Types of Trees

  • Depending on the model, data from current day species does not distinguish between different placements of the root

vs


Types of trees14 l.jpg
Types of trees

  • Unrooted tree represents the same phylogeny with out the root node


Positioning roots in unrooted trees l.jpg
Positioning Roots in Unrooted Trees

  • We can estimate the position of the root by introducing an outgroup:

    • a set of species that are definitely distant from all the species of interest

Proposed root

Falcon

Aardvark

Bison

Chimp

Dog

Elephant


Types of data l.jpg
Types of Data

  • Distance-based

    • Input is a matrix of distances between species

    • Can be fraction of residues they disagree on, or -alignment score between them, or …

  • Character-based

    • Examine each character (e.g., residue) separately


Simple distance based method l.jpg
Simple Distance-Based Method

Input: distance matrix between species

Outline:

  • Cluster species together

  • Initially clusters are singletons

  • At each iteration combine two “closest” clusters to get a new one


Upgma clustering l.jpg
UPGMA Clustering

  • Let Ci and Cj be clusters, define distance between them to be

  • When combining two clusters, Ci and Cj, to form a new cluster Ck, then


Molecular clock l.jpg
Molecular Clock

  • UPGMA implicitly assumes that all distances measure time in the same way

2

3

2

3

4

1

4

1


Additivity l.jpg
Additivity

  • A weaker requirement is additivity

    • In “real” tree, distances between species are the sum of distances between intermediate nodes

k

c

b

j

a

i


Consequences of additivity l.jpg
Consequences of Additivity

  • Suppose input distances are additive

  • For any three leaves

  • Thus

k

c

b

j

a

m

i


Neighbor joining l.jpg
Neighbor Joining

  • Can we use this fact to construct trees?

  • Let

    where

    Theorem: if D(i,j) is minimal (among all pairs of leaves), then i and j are neighbors in the tree


Neighbor joining23 l.jpg

k

m

j

i

Neighbor Joining

  • Set L to contain all leaves

    Iteration:

  • Choose i,j such that D(i,j) is minimal

  • Create new node k, and set

  • remove i,j from L, and add k

    Terminate:when |L| =2, connect two remaining nodes


Distance based methods l.jpg
Distance Based Methods

  • If we make strong assumptions on distances, we can reconstruct trees

  • In real-life distances are not additive

  • Sometimes they are close to additive


Character based methods l.jpg
Character Based Methods

  • We start with a multiple alignment

  • Assumptions:

    • All sequences are homologous

    • Each position in alignment is homologous

    • Positions evolve independently

    • No gaps

  • We seek to explain the evolution of each position in the alignment


Parsimony l.jpg
Parsimony

  • Character-based method

  • A way to score trees (but not to build trees!)

    Assumptions:

  • Independence of characters (no interactions)

  • Best tree is one where minimal changes take place


A simple example l.jpg

Aardvark

Bison

Chimp

Dog

Elephant

A Simple Example

  • What is the parsimony score of

A: CAGGTA

B: CAGACA

C: CGGGTA

D: TGCACT

E: TGCGTA


A simple example28 l.jpg
A Simple Example

A: CAGGTA

B: CAGACA

C: CGGGTA

D: TGCACT

E: TGCGTA

  • Each column is scored separately.

  • Let’s look at the first column:

  • Minimal tree has one evolutionary change:

C

T

C

T

C

C

C

T

T  C


Evaluating parsimony scores l.jpg
Evaluating Parsimony Scores

  • How do we compute the Parsimony score for a given tree?

  • Traditional Parsimony

    • Each base change has a cost of 1

  • Weighted Parsimony

    • Each change is weighted by the score c(a,b)


Traditional parsimony l.jpg

a

g

a

Traditional Parsimony

a

{a}

  • Solved independently for each position

  • Linear time solution

a

{a,g}


Evaluating weighted parsimony l.jpg
Evaluating Weighted Parsimony

Dynamic programming on the tree

S(i,a) = cost of tree rooted at i if i is labeled by a

Initialization:

  • For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) = 

    Iteration:

  • if k is a node with children i and j, then S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b))

    Termination:

  • cost of tree is minaS(r,a) where r is the root


Cost of evaluating parsimony l.jpg
Cost of Evaluating Parsimony

  • Score is evaluated on each position independetly. Scores are then summed over all positions.

  • If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk)

  • By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node


Maximum parsimony l.jpg
Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G

How many possible unrooted trees?


Maximum parsimony34 l.jpg
Maximum Parsimony

How many possible unrooted trees?

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G


Maximum parsimony35 l.jpg
Maximum Parsimony

How many substitutions?

MP


Maximum parsimony36 l.jpg

0

0

0

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G


Maximum parsimony37 l.jpg

0 3

0 3

0 3

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G


Maximum parsimony38 l.jpg

G

T

3

C

A

C

G

C

3

T

A

C

G

T

3

A

C

C

Maximum Parsimony

2

1 - G

2 - C

3 - T

4 - A


Maximum parsimony39 l.jpg

0 3 2

0 3 2

0 3 2

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G


Maximum parsimony40 l.jpg

0 3 2 2

0 3 2 2

0 3 2 1

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G


Maximum parsimony41 l.jpg

G

A

2

A

G

A

G

A

2

A

G

A

A

G

1

A

G

A

Maximum Parsimony

4

1 - G

2 - A

3 - A

4 - G


Maximum parsimony42 l.jpg

0 3 2 2 0 1 1 1 1 3 14

0 3 2 2 0 1 2 1 2 3 16

0 3 2 1 0 1 2 1 2 3 15

Maximum Parsimony


Maximum parsimony43 l.jpg
Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3 2 2 0 1 1 1 1 3 14



Searching for the optimal tree l.jpg
Searching for the Optimal Tree

  • Exhaustive Search

    • Very intensive

  • Branch and Bound

    • A compromise

  • Heuristic

    • Fast

    • Usually starts with NJ


Phylogenetic tree assumptions l.jpg

branch

internal node

leaf

Phylogenetic Tree Assumptions

  • Topology: bifurcating

    • Leaves - 1…N

    • Internal nodes N+1…2N-2

  • Lengths t = {ti} for each branch

  • Phylogenetic tree = (Topology, Lengths) = (T,t)


Probabilistic methods l.jpg
Probabilistic Methods

  • The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences.

  • Background probabilities: q(a)

  • Mutation probabilities: P(a|b,t)

  • Models for evolutionary mutations

    • Jukes Cantor

    • Kimura 2-parameter model

  • Such models are used to derive the probabilities


Jukes cantor model l.jpg
Jukes Cantor model

  • A model for mutation rates

  • Mutation occurs at a constant rate

  • Each nucleotide is equally likely to mutate into any other nucleotide with rate a.


Kimura 2 parameter model l.jpg
Kimura 2-parameter model

  • Allows a different rate for transitions and transversions.


Mutation probabilities l.jpg
Mutation Probabilities

  • The rate matrix R is used to derive the mutation probability matrix S:

  • S is obtained by integration. For Jukes Cantor:

  • q can be obtained by setting t to infinity


Mutation probabilities51 l.jpg

A

C

G

T

Mutation Probabilities

  • Both models satisfy the following properties:

  • Lack of memory:

  • Reversibility:

    • Exist stationary probabilities {Pa} s.t.


Probabilistic approach l.jpg
Probabilistic Approach

  • Given P,q, the tree topology and branch lengths, we can compute:

x5

t4

x4

t2

t3

t1

x1

x2

x3


Computing the tree likelihood l.jpg
Computing the Tree Likelihood

  • We are interested in the probability of observed data given tree and branch “lengths”:

  • Computed by summing over internal nodes

  • This can be done efficiently using a tree upward traversal pass.


Tree likelihood computation l.jpg
Tree Likelihood Computation

  • Define P(Lk|a)= prob. of leaves below node k given that xk=a

  • Init: for leaves: P(Lk|a)=1 if xk=a ; 0 otherwise

  • Iteration: if k is node with children i and j, then

  • Termination: Likelihood is


Maximum likelihood ml l.jpg
Maximum Likelihood (ML)

  • Score each tree by

    • Assumption of independent positions

  • Branch lengths t can be optimized

    • Gradient ascent

    • EM

  • We look for the highest scoring tree

    • Exhaustive search

    • Sampling methods (Metropolis)


Optimal tree search l.jpg

T3

T4

T2

Tn

T1

Optimal Tree Search

  • Perform search over possible topologies

Parameter space

Parametric optimization (EM)

Local Maxima


Computational problem l.jpg
Computational Problem

  • Such procedures are computationally expensive!

  • Computation of optimal parameters, per candidate, requires non-trivial optimization step.

  • Spend non-negligible computation on a candidate, even if it is a low scoring one.

  • In practice, such learning procedures can only consider small sets of candidate structures


Structural em l.jpg
Structural EM

Idea:Use parameters found for current topology to help evaluate new topologies.

Outline:

  • Perform search in (T, t) space.

  • Use EM-like iterations:

    • E-step: use current solution to compute expected sufficient statistics for all topologies

    • M-step: select new topology based on these expected sufficient statistics


The complete data scenario l.jpg

Si,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,j

Define:

Find: topology T that maximizes

F is a linear function of Si,j

The Complete-Data Scenario

Suppose we observe H, the ancestral sequences.


Expected likelihood l.jpg
Expected Likelihood

  • Start with a tree (T0,t0)

  • Compute

    Formal justification:

  • Define:

    Theorem:

    Consequence: improvement in expected score improvement in likelihood


Algorithm outline l.jpg

Weights:

Original Tree (T0,t0)

Compute:

Algorithm Outline

Unlike standard EM for trees, we compute all possible pairwise

statistics

Time: O(N2M)


Algorithm outline62 l.jpg

Weights:

Find:

Compute:

Algorithm Outline

Pairwise weights

This stage also computes the branch length for each pair (i,j)


Algorithm outline63 l.jpg

Weights:

Find:

Construct bifurcation T1

Compute:

Algorithm Outline

Max. Spanning Tree

Fast greedy procedure to find tree

By construction:

Q(T’,t’)  Q(T0,t0)

Thus,

l(T’,t’)  l(T0,t0)


Algorithm outline64 l.jpg

Weights:

Find:

Construct bifurcation T1

Compute:

Algorithm Outline

Fix Tree

Remove redundant nodes

Add nodes to break large degree

This operation preserves likelihood

l(T1,t’) =l(T’,t’)  l(T0,t0)


Assessing trees the bootstrap l.jpg
Assessing trees: the Bootstrap

  • Often we don’t trust the tree found as the “correct” one.

  • Bootstrapping:

    • Sample (with replacement) n positions from the alignment

    • Learn the best tree for each sample

    • Look for tree features which are frequent in all trees.

  • For some models this procedure approximates the tree posterior P(T| X1,…,Xn)


Algorithm outline66 l.jpg

Weights:

Find:

Compute:

Algorithm Outline

Construct bifurcation T1

New Tree

Thm: l(T1,t1)  l(T0,t0)

These steps are then repeated until convergence


ad