- By
**Donna** - Follow User

- 306 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Class 9: Phylogenetic Trees' - Donna

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

The Tree of Life

D’après Ernst Haeckel, 1891

Evolution

- Many theories of evolution
- Basic idea:
- speciation events lead to creation of different species
- Speciation caused by physical separation into groups where different genetic variants become dominant
- Any two species share a (possibly distant) common ancestor

Phylogenies

- A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species
- Leafs - current day species
- Nodes - hypothetical most recent common ancestors
- Edges length - “time” from one speciation to the next

Aardvark

Bison

Chimp

Dog

Elephant

How to construct a Phylogeny?

- Until mid 1950’s phylogenies were constructed by experts based on their opinion (subjective criteria)
- Since then, focus on objective criteria for constructing phylogenetic trees
- Thousands of articles in the last decades
- Important for many aspects of biology
- Classification (systematics)
- Understanding biological mechanisms

Morphological vs. Molecular

- Classical phylogenetic analysis: morphological features
- number of legs, lengths of legs, etc.
- Modern biological methods allow to use molecular features
- Gene sequences
- Protein sequences
- Analysis based on homologous sequences (e.g., globins) in different species

Dangers in Molecular Phylogenies

- We have to remember that gene/protein sequence can be homologous for different reasons:
- Orthologs -- sequences diverged after a speciation event
- Paralogs -- sequences diverged after a duplication event
- Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

Dangers of Paralogs

- If we only consider 1A, 2B, and 3A...

Gene Duplication

Speciation events

2B

1B

3A

3B

2A

1A

Types of Trees

- Depending on the model, data from current day species does not distinguish between different placements of the root

vs

Types of trees

- Unrooted tree represents the same phylogeny with out the root node

Positioning Roots in Unrooted Trees

- We can estimate the position of the root by introducing an outgroup:
- a set of species that are definitely distant from all the species of interest

Proposed root

Falcon

Aardvark

Bison

Chimp

Dog

Elephant

Types of Data

- Distance-based
- Input is a matrix of distances between species
- Can be fraction of residues they disagree on, or -alignment score between them, or …
- Character-based
- Examine each character (e.g., residue) separately

Simple Distance-Based Method

Input: distance matrix between species

Outline:

- Cluster species together
- Initially clusters are singletons
- At each iteration combine two “closest” clusters to get a new one

UPGMA Clustering

- Let Ci and Cj be clusters, define distance between them to be
- When combining two clusters, Ci and Cj, to form a new cluster Ck, then

Additivity

- A weaker requirement is additivity
- In “real” tree, distances between species are the sum of distances between intermediate nodes

k

c

b

j

a

i

Neighbor Joining

- Can we use this fact to construct trees?
- Let

where

Theorem: if D(i,j) is minimal (among all pairs of leaves), then i and j are neighbors in the tree

k

m

j

i

Neighbor Joining- Set L to contain all leaves

Iteration:

- Choose i,j such that D(i,j) is minimal
- Create new node k, and set
- remove i,j from L, and add k

Terminate:when |L| =2, connect two remaining nodes

Distance Based Methods

- If we make strong assumptions on distances, we can reconstruct trees
- In real-life distances are not additive
- Sometimes they are close to additive

Character Based Methods

- We start with a multiple alignment
- Assumptions:
- All sequences are homologous
- Each position in alignment is homologous
- Positions evolve independently
- No gaps
- We seek to explain the evolution of each position in the alignment

Parsimony

- Character-based method
- A way to score trees (but not to build trees!)

Assumptions:

- Independence of characters (no interactions)
- Best tree is one where minimal changes take place

Aardvark

Bison

Chimp

Dog

Elephant

A Simple Example- What is the parsimony score of

A: CAGGTA

B: CAGACA

C: CGGGTA

D: TGCACT

E: TGCGTA

A Simple Example

A: CAGGTA

B: CAGACA

C: CGGGTA

D: TGCACT

E: TGCGTA

- Each column is scored separately.
- Let’s look at the first column:
- Minimal tree has one evolutionary change:

C

T

C

T

C

C

C

T

T C

Evaluating Parsimony Scores

- How do we compute the Parsimony score for a given tree?
- Traditional Parsimony
- Each base change has a cost of 1
- Weighted Parsimony
- Each change is weighted by the score c(a,b)

Evaluating Weighted Parsimony

Dynamic programming on the tree

S(i,a) = cost of tree rooted at i if i is labeled by a

Initialization:

- For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) =

Iteration:

- if k is a node with children i and j, then S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b))

Termination:

- cost of tree is minaS(r,a) where r is the root

Cost of Evaluating Parsimony

- Score is evaluated on each position independetly. Scores are then summed over all positions.
- If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk)
- By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G

How many possible unrooted trees?

Maximum Parsimony

How many possible unrooted trees?

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G

0

0

0

Maximum Parsimony1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3

0 3

0 3

Maximum Parsimony1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3 2

0 3 2

0 3 2

Maximum Parsimony1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3 2 2

0 3 2 2

0 3 2 1

Maximum Parsimony1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3 2 2 0 1 1 1 1 3 14

Searching for the Optimal Tree

- Exhaustive Search
- Very intensive
- Branch and Bound
- A compromise
- Heuristic
- Fast
- Usually starts with NJ

branch

internal node

leaf

Phylogenetic Tree Assumptions- Topology: bifurcating
- Leaves - 1…N
- Internal nodes N+1…2N-2
- Lengths t = {ti} for each branch
- Phylogenetic tree = (Topology, Lengths) = (T,t)

Probabilistic Methods

- The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences.
- Background probabilities: q(a)
- Mutation probabilities: P(a|b,t)
- Models for evolutionary mutations
- Jukes Cantor
- Kimura 2-parameter model
- Such models are used to derive the probabilities

Jukes Cantor model

- A model for mutation rates

- Mutation occurs at a constant rate
- Each nucleotide is equally likely to mutate into any other nucleotide with rate a.

Kimura 2-parameter model

- Allows a different rate for transitions and transversions.

Mutation Probabilities

- The rate matrix R is used to derive the mutation probability matrix S:
- S is obtained by integration. For Jukes Cantor:
- q can be obtained by setting t to infinity

A

C

G

T

Mutation Probabilities- Both models satisfy the following properties:
- Lack of memory:
- Reversibility:
- Exist stationary probabilities {Pa} s.t.

Probabilistic Approach

- Given P,q, the tree topology and branch lengths, we can compute:

x5

t4

x4

t2

t3

t1

x1

x2

x3

Computing the Tree Likelihood

- We are interested in the probability of observed data given tree and branch “lengths”:
- Computed by summing over internal nodes
- This can be done efficiently using a tree upward traversal pass.

Tree Likelihood Computation

- Define P(Lk|a)= prob. of leaves below node k given that xk=a
- Init: for leaves: P(Lk|a)=1 if xk=a ; 0 otherwise
- Iteration: if k is node with children i and j, then
- Termination: Likelihood is

Maximum Likelihood (ML)

- Score each tree by
- Assumption of independent positions
- Branch lengths t can be optimized
- Gradient ascent
- EM
- We look for the highest scoring tree
- Exhaustive search
- Sampling methods (Metropolis)

T3

T4

T2

Tn

T1

Optimal Tree Search- Perform search over possible topologies

Parameter space

Parametric optimization (EM)

Local Maxima

Computational Problem

- Such procedures are computationally expensive!
- Computation of optimal parameters, per candidate, requires non-trivial optimization step.
- Spend non-negligible computation on a candidate, even if it is a low scoring one.
- In practice, such learning procedures can only consider small sets of candidate structures

Structural EM

Idea:Use parameters found for current topology to help evaluate new topologies.

Outline:

- Perform search in (T, t) space.
- Use EM-like iterations:
- E-step: use current solution to compute expected sufficient statistics for all topologies
- M-step: select new topology based on these expected sufficient statistics

Si,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,j

Define:

Find: topology T that maximizes

F is a linear function of Si,j

The Complete-Data ScenarioSuppose we observe H, the ancestral sequences.

Expected Likelihood

- Start with a tree (T0,t0)
- Compute

Formal justification:

- Define:

Theorem:

Consequence: improvement in expected score improvement in likelihood

Weights:

Original Tree (T0,t0)

Compute:

Algorithm OutlineUnlike standard EM for trees, we compute all possible pairwise

statistics

Time: O(N2M)

Weights:

Find:

Compute:

Algorithm OutlinePairwise weights

This stage also computes the branch length for each pair (i,j)

Weights:

Find:

Construct bifurcation T1

Compute:

Algorithm OutlineMax. Spanning Tree

Fast greedy procedure to find tree

By construction:

Q(T’,t’) Q(T0,t0)

Thus,

l(T’,t’) l(T0,t0)

Weights:

Find:

Construct bifurcation T1

Compute:

Algorithm OutlineFix Tree

Remove redundant nodes

Add nodes to break large degree

This operation preserves likelihood

l(T1,t’) =l(T’,t’) l(T0,t0)

Assessing trees: the Bootstrap

- Often we don’t trust the tree found as the “correct” one.
- Bootstrapping:
- Sample (with replacement) n positions from the alignment
- Learn the best tree for each sample
- Look for tree features which are frequent in all trees.
- For some models this procedure approximates the tree posterior P(T| X1,…,Xn)

Weights:

Find:

Compute:

Algorithm OutlineConstruct bifurcation T1

New Tree

Thm: l(T1,t1) l(T0,t0)

These steps are then repeated until convergence

Download Presentation

Connecting to Server..