class 9 phylogenetic trees
Download
Skip this Video
Download Presentation
Class 9: Phylogenetic Trees

Loading in 2 Seconds...

play fullscreen
1 / 66

Class 9: Phylogenetic Trees - PowerPoint PPT Presentation


  • 306 Views
  • Uploaded on

Class 9: Phylogenetic Trees. The Tree of Life. D’après Ernst Haeckel, 1891 . Evolution. Many theories of evolution Basic idea: speciation events lead to creation of different species Speciation caused by physical separation into groups where different genetic variants become dominant

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Class 9: Phylogenetic Trees' - Donna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the tree of life
The Tree of Life

D’après Ernst Haeckel, 1891

evolution
Evolution
  • Many theories of evolution
  • Basic idea:
    • speciation events lead to creation of different species
    • Speciation caused by physical separation into groups where different genetic variants become dominant
  • Any two species share a (possibly distant) common ancestor
phylogenies
Phylogenies
  • A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species
  • Leafs - current day species
  • Nodes - hypothetical most recent common ancestors
  • Edges length - “time” from one speciation to the next

Aardvark

Bison

Chimp

Dog

Elephant

phylogenetic tree
branch

internal node

leaf

Phylogenetic Tree
  • Topology: bifurcating
    • Leaves - 1…N
    • Internal nodes N+1…2N-2
example primate evolution
Example: Primate evolution

20-25 mya

35-37 mya

40-45 mya

how to construct a phylogeny
How to construct a Phylogeny?
  • Until mid 1950’s phylogenies were constructed by experts based on their opinion (subjective criteria)
  • Since then, focus on objective criteria for constructing phylogenetic trees
    • Thousands of articles in the last decades
  • Important for many aspects of biology
    • Classification (systematics)
    • Understanding biological mechanisms
morphological vs molecular
Morphological vs. Molecular
  • Classical phylogenetic analysis: morphological features
    • number of legs, lengths of legs, etc.
  • Modern biological methods allow to use molecular features
    • Gene sequences
    • Protein sequences
  • Analysis based on homologous sequences (e.g., globins) in different species
dangers in molecular phylogenies
Dangers in Molecular Phylogenies
  • We have to remember that gene/protein sequence can be homologous for different reasons:
  • Orthologs -- sequences diverged after a speciation event
  • Paralogs -- sequences diverged after a duplication event
  • Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)
dangers of paralogues
Dangers of Paralogues

Gene Duplication

Speciation events

2B

1B

3A

3B

2A

1A

dangers of paralogs
Dangers of Paralogs
  • If we only consider 1A, 2B, and 3A...

Gene Duplication

Speciation events

2B

1B

3A

3B

2A

1A

types of trees
Types of Trees
  • A natural model to consider is that of rooted trees

Common

Ancestor

types of trees13
Types of Trees
  • Depending on the model, data from current day species does not distinguish between different placements of the root

vs

types of trees14
Types of trees
  • Unrooted tree represents the same phylogeny with out the root node
positioning roots in unrooted trees
Positioning Roots in Unrooted Trees
  • We can estimate the position of the root by introducing an outgroup:
    • a set of species that are definitely distant from all the species of interest

Proposed root

Falcon

Aardvark

Bison

Chimp

Dog

Elephant

types of data
Types of Data
  • Distance-based
    • Input is a matrix of distances between species
    • Can be fraction of residues they disagree on, or -alignment score between them, or …
  • Character-based
    • Examine each character (e.g., residue) separately
simple distance based method
Simple Distance-Based Method

Input: distance matrix between species

Outline:

  • Cluster species together
  • Initially clusters are singletons
  • At each iteration combine two “closest” clusters to get a new one
upgma clustering
UPGMA Clustering
  • Let Ci and Cj be clusters, define distance between them to be
  • When combining two clusters, Ci and Cj, to form a new cluster Ck, then
molecular clock
Molecular Clock
  • UPGMA implicitly assumes that all distances measure time in the same way

2

3

2

3

4

1

4

1

additivity
Additivity
  • A weaker requirement is additivity
    • In “real” tree, distances between species are the sum of distances between intermediate nodes

k

c

b

j

a

i

consequences of additivity
Consequences of Additivity
  • Suppose input distances are additive
  • For any three leaves
  • Thus

k

c

b

j

a

m

i

neighbor joining
Neighbor Joining
  • Can we use this fact to construct trees?
  • Let

where

Theorem: if D(i,j) is minimal (among all pairs of leaves), then i and j are neighbors in the tree

neighbor joining23
k

m

j

i

Neighbor Joining
  • Set L to contain all leaves

Iteration:

  • Choose i,j such that D(i,j) is minimal
  • Create new node k, and set
  • remove i,j from L, and add k

Terminate:when |L| =2, connect two remaining nodes

distance based methods
Distance Based Methods
  • If we make strong assumptions on distances, we can reconstruct trees
  • In real-life distances are not additive
  • Sometimes they are close to additive
character based methods
Character Based Methods
  • We start with a multiple alignment
  • Assumptions:
    • All sequences are homologous
    • Each position in alignment is homologous
    • Positions evolve independently
    • No gaps
  • We seek to explain the evolution of each position in the alignment
parsimony
Parsimony
  • Character-based method
  • A way to score trees (but not to build trees!)

Assumptions:

  • Independence of characters (no interactions)
  • Best tree is one where minimal changes take place
a simple example
Aardvark

Bison

Chimp

Dog

Elephant

A Simple Example
  • What is the parsimony score of

A: CAGGTA

B: CAGACA

C: CGGGTA

D: TGCACT

E: TGCGTA

a simple example28
A Simple Example

A: CAGGTA

B: CAGACA

C: CGGGTA

D: TGCACT

E: TGCGTA

  • Each column is scored separately.
  • Let’s look at the first column:
  • Minimal tree has one evolutionary change:

C

T

C

T

C

C

C

T

T  C

evaluating parsimony scores
Evaluating Parsimony Scores
  • How do we compute the Parsimony score for a given tree?
  • Traditional Parsimony
    • Each base change has a cost of 1
  • Weighted Parsimony
    • Each change is weighted by the score c(a,b)
traditional parsimony
a

g

a

Traditional Parsimony

a

{a}

  • Solved independently for each position
  • Linear time solution

a

{a,g}

evaluating weighted parsimony
Evaluating Weighted Parsimony

Dynamic programming on the tree

S(i,a) = cost of tree rooted at i if i is labeled by a

Initialization:

  • For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) = 

Iteration:

  • if k is a node with children i and j, then S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b))

Termination:

  • cost of tree is minaS(r,a) where r is the root
cost of evaluating parsimony
Cost of Evaluating Parsimony
  • Score is evaluated on each position independetly. Scores are then summed over all positions.
  • If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk)
  • By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node
maximum parsimony
Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G

How many possible unrooted trees?

maximum parsimony34
Maximum Parsimony

How many possible unrooted trees?

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G

maximum parsimony35
Maximum Parsimony

How many substitutions?

MP

maximum parsimony36
0

0

0

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

maximum parsimony37
0 3

0 3

0 3

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

maximum parsimony38
G

T

3

C

A

C

G

C

3

T

A

C

G

T

3

A

C

C

Maximum Parsimony

2

1 - G

2 - C

3 - T

4 - A

maximum parsimony39
0 3 2

0 3 2

0 3 2

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

maximum parsimony40
0 3 2 2

0 3 2 2

0 3 2 1

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

maximum parsimony41
G

A

2

A

G

A

G

A

2

A

G

A

A

G

1

A

G

A

Maximum Parsimony

4

1 - G

2 - A

3 - A

4 - G

maximum parsimony42
0 3 2 2 0 1 1 1 1 3 14

0 3 2 2 0 1 2 1 2 3 16

0 3 2 1 0 1 2 1 2 3 15

Maximum Parsimony
maximum parsimony43
Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3 2 2 0 1 1 1 1 3 14

searching for the optimal tree
Searching for the Optimal Tree
  • Exhaustive Search
    • Very intensive
  • Branch and Bound
    • A compromise
  • Heuristic
    • Fast
    • Usually starts with NJ
phylogenetic tree assumptions
branch

internal node

leaf

Phylogenetic Tree Assumptions
  • Topology: bifurcating
    • Leaves - 1…N
    • Internal nodes N+1…2N-2
  • Lengths t = {ti} for each branch
  • Phylogenetic tree = (Topology, Lengths) = (T,t)
probabilistic methods
Probabilistic Methods
  • The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences.
  • Background probabilities: q(a)
  • Mutation probabilities: P(a|b,t)
  • Models for evolutionary mutations
    • Jukes Cantor
    • Kimura 2-parameter model
  • Such models are used to derive the probabilities
jukes cantor model
Jukes Cantor model
  • A model for mutation rates
  • Mutation occurs at a constant rate
  • Each nucleotide is equally likely to mutate into any other nucleotide with rate a.
kimura 2 parameter model
Kimura 2-parameter model
  • Allows a different rate for transitions and transversions.
mutation probabilities
Mutation Probabilities
  • The rate matrix R is used to derive the mutation probability matrix S:
  • S is obtained by integration. For Jukes Cantor:
  • q can be obtained by setting t to infinity
mutation probabilities51
A

C

G

T

Mutation Probabilities
  • Both models satisfy the following properties:
  • Lack of memory:
  • Reversibility:
    • Exist stationary probabilities {Pa} s.t.
probabilistic approach
Probabilistic Approach
  • Given P,q, the tree topology and branch lengths, we can compute:

x5

t4

x4

t2

t3

t1

x1

x2

x3

computing the tree likelihood
Computing the Tree Likelihood
  • We are interested in the probability of observed data given tree and branch “lengths”:
  • Computed by summing over internal nodes
  • This can be done efficiently using a tree upward traversal pass.
tree likelihood computation
Tree Likelihood Computation
  • Define P(Lk|a)= prob. of leaves below node k given that xk=a
  • Init: for leaves: P(Lk|a)=1 if xk=a ; 0 otherwise
  • Iteration: if k is node with children i and j, then
  • Termination: Likelihood is
maximum likelihood ml
Maximum Likelihood (ML)
  • Score each tree by
    • Assumption of independent positions
  • Branch lengths t can be optimized
    • Gradient ascent
    • EM
  • We look for the highest scoring tree
    • Exhaustive search
    • Sampling methods (Metropolis)
optimal tree search
T3

T4

T2

Tn

T1

Optimal Tree Search
  • Perform search over possible topologies

Parameter space

Parametric optimization (EM)

Local Maxima

computational problem
Computational Problem
  • Such procedures are computationally expensive!
  • Computation of optimal parameters, per candidate, requires non-trivial optimization step.
  • Spend non-negligible computation on a candidate, even if it is a low scoring one.
  • In practice, such learning procedures can only consider small sets of candidate structures
structural em
Structural EM

Idea:Use parameters found for current topology to help evaluate new topologies.

Outline:

  • Perform search in (T, t) space.
  • Use EM-like iterations:
    • E-step: use current solution to compute expected sufficient statistics for all topologies
    • M-step: select new topology based on these expected sufficient statistics
the complete data scenario
Si,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,j

Define:

Find: topology T that maximizes

F is a linear function of Si,j

The Complete-Data Scenario

Suppose we observe H, the ancestral sequences.

expected likelihood
Expected Likelihood
  • Start with a tree (T0,t0)
  • Compute

Formal justification:

  • Define:

Theorem:

Consequence: improvement in expected score improvement in likelihood

algorithm outline
Weights:

Original Tree (T0,t0)

Compute:

Algorithm Outline

Unlike standard EM for trees, we compute all possible pairwise

statistics

Time: O(N2M)

algorithm outline62
Weights:

Find:

Compute:

Algorithm Outline

Pairwise weights

This stage also computes the branch length for each pair (i,j)

algorithm outline63
Weights:

Find:

Construct bifurcation T1

Compute:

Algorithm Outline

Max. Spanning Tree

Fast greedy procedure to find tree

By construction:

Q(T’,t’)  Q(T0,t0)

Thus,

l(T’,t’)  l(T0,t0)

algorithm outline64
Weights:

Find:

Construct bifurcation T1

Compute:

Algorithm Outline

Fix Tree

Remove redundant nodes

Add nodes to break large degree

This operation preserves likelihood

l(T1,t’) =l(T’,t’)  l(T0,t0)

assessing trees the bootstrap
Assessing trees: the Bootstrap
  • Often we don’t trust the tree found as the “correct” one.
  • Bootstrapping:
    • Sample (with replacement) n positions from the alignment
    • Learn the best tree for each sample
    • Look for tree features which are frequent in all trees.
  • For some models this procedure approximates the tree posterior P(T| X1,…,Xn)
algorithm outline66
Weights:

Find:

Compute:

Algorithm Outline

Construct bifurcation T1

New Tree

Thm: l(T1,t1)  l(T0,t0)

These steps are then repeated until convergence

ad