Loading in 5 sec....

Class 9: Phylogenetic TreesPowerPoint Presentation

Class 9: Phylogenetic Trees

- 288 Views
- Updated On :
- Presentation posted in: Pets / Animals

Class 9: Phylogenetic Trees. The Tree of Life. D’après Ernst Haeckel, 1891 . Evolution. Many theories of evolution Basic idea: speciation events lead to creation of different species Speciation caused by physical separation into groups where different genetic variants become dominant

Class 9: Phylogenetic Trees

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Class 9: Phylogenetic Trees

.

D’après Ernst Haeckel, 1891

- Many theories of evolution
- Basic idea:
- speciation events lead to creation of different species
- Speciation caused by physical separation into groups where different genetic variants become dominant

- Any two species share a (possibly distant) common ancestor

- A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species
- Leafs - current day species
- Nodes - hypothetical most recent common ancestors
- Edges length - “time” from one speciation to the next

Aardvark

Bison

Chimp

Dog

Elephant

branch

internal node

leaf

- Topology: bifurcating
- Leaves - 1…N
- Internal nodes N+1…2N-2

20-25 mya

35-37 mya

40-45 mya

- Until mid 1950’s phylogenies were constructed by experts based on their opinion (subjective criteria)
- Since then, focus on objective criteria for constructing phylogenetic trees
- Thousands of articles in the last decades

- Important for many aspects of biology
- Classification (systematics)
- Understanding biological mechanisms

- Classical phylogenetic analysis: morphological features
- number of legs, lengths of legs, etc.

- Modern biological methods allow to use molecular features
- Gene sequences
- Protein sequences

- Analysis based on homologous sequences (e.g., globins) in different species

- We have to remember that gene/protein sequence can be homologous for different reasons:
- Orthologs -- sequences diverged after a speciation event
- Paralogs -- sequences diverged after a duplication event
- Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

Gene Duplication

Speciation events

2B

1B

3A

3B

2A

1A

- If we only consider 1A, 2B, and 3A...

Gene Duplication

Speciation events

2B

1B

3A

3B

2A

1A

- A natural model to consider is that of rooted trees

Common

Ancestor

- Depending on the model, data from current day species does not distinguish between different placements of the root

vs

- Unrooted tree represents the same phylogeny with out the root node

- We can estimate the position of the root by introducing an outgroup:
- a set of species that are definitely distant from all the species of interest

Proposed root

Falcon

Aardvark

Bison

Chimp

Dog

Elephant

- Distance-based
- Input is a matrix of distances between species
- Can be fraction of residues they disagree on, or -alignment score between them, or …

- Character-based
- Examine each character (e.g., residue) separately

Input: distance matrix between species

Outline:

- Cluster species together
- Initially clusters are singletons
- At each iteration combine two “closest” clusters to get a new one

- Let Ci and Cj be clusters, define distance between them to be
- When combining two clusters, Ci and Cj, to form a new cluster Ck, then

- UPGMA implicitly assumes that all distances measure time in the same way

2

3

2

3

4

1

4

1

- A weaker requirement is additivity
- In “real” tree, distances between species are the sum of distances between intermediate nodes

k

c

b

j

a

i

- Suppose input distances are additive
- For any three leaves
- Thus

k

c

b

j

a

m

i

- Can we use this fact to construct trees?
- Let
where

Theorem: if D(i,j) is minimal (among all pairs of leaves), then i and j are neighbors in the tree

k

m

j

i

- Set L to contain all leaves
Iteration:

- Choose i,j such that D(i,j) is minimal
- Create new node k, and set
- remove i,j from L, and add k
Terminate:when |L| =2, connect two remaining nodes

- If we make strong assumptions on distances, we can reconstruct trees
- In real-life distances are not additive
- Sometimes they are close to additive

- We start with a multiple alignment
- Assumptions:
- All sequences are homologous
- Each position in alignment is homologous
- Positions evolve independently
- No gaps

- We seek to explain the evolution of each position in the alignment

- Character-based method
- A way to score trees (but not to build trees!)
Assumptions:

- Independence of characters (no interactions)
- Best tree is one where minimal changes take place

Aardvark

Bison

Chimp

Dog

Elephant

- What is the parsimony score of

A: CAGGTA

B: CAGACA

C: CGGGTA

D: TGCACT

E: TGCGTA

A: CAGGTA

B: CAGACA

C: CGGGTA

D: TGCACT

E: TGCGTA

- Each column is scored separately.
- Let’s look at the first column:
- Minimal tree has one evolutionary change:

C

T

C

T

C

C

C

T

T C

- How do we compute the Parsimony score for a given tree?
- Traditional Parsimony
- Each base change has a cost of 1

- Weighted Parsimony
- Each change is weighted by the score c(a,b)

a

g

a

a

{a}

- Solved independently for each position
- Linear time solution

a

{a,g}

Dynamic programming on the tree

S(i,a) = cost of tree rooted at i if i is labeled by a

Initialization:

- For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) =
Iteration:

- if k is a node with children i and j, then S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b))
Termination:

- cost of tree is minaS(r,a) where r is the root

- Score is evaluated on each position independetly. Scores are then summed over all positions.
- If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk)
- By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G

How many possible unrooted trees?

How many possible unrooted trees?

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G

How many substitutions?

MP

0

0

0

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3

0 3

0 3

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

G

T

3

C

A

C

G

C

3

T

A

C

G

T

3

A

C

C

2

1 - G

2 - C

3 - T

4 - A

0 3 2

0 3 2

0 3 2

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3 2 2

0 3 2 2

0 3 2 1

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

G

A

2

A

G

A

G

A

2

A

G

A

A

G

1

A

G

A

4

1 - G

2 - A

3 - A

4 - G

0 3 2 2 0 1 1 1 1 3 14

0 3 2 2 0 1 2 1 2 3 16

0 3 2 1 0 1 2 1 2 3 15

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3 2 2 0 1 1 1 1 3 14

- Exhaustive Search
- Very intensive

- Branch and Bound
- A compromise

- Heuristic
- Fast
- Usually starts with NJ

branch

internal node

leaf

- Topology: bifurcating
- Leaves - 1…N
- Internal nodes N+1…2N-2

- Lengths t = {ti} for each branch
- Phylogenetic tree = (Topology, Lengths) = (T,t)

- The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences.
- Background probabilities: q(a)
- Mutation probabilities: P(a|b,t)
- Models for evolutionary mutations
- Jukes Cantor
- Kimura 2-parameter model

- Such models are used to derive the probabilities

- A model for mutation rates

- Mutation occurs at a constant rate
- Each nucleotide is equally likely to mutate into any other nucleotide with rate a.

- Allows a different rate for transitions and transversions.

- The rate matrix R is used to derive the mutation probability matrix S:
- S is obtained by integration. For Jukes Cantor:
- q can be obtained by setting t to infinity

A

C

G

T

- Both models satisfy the following properties:
- Lack of memory:
- Reversibility:
- Exist stationary probabilities {Pa} s.t.

- Given P,q, the tree topology and branch lengths, we can compute:

x5

t4

x4

t2

t3

t1

x1

x2

x3

- We are interested in the probability of observed data given tree and branch “lengths”:
- Computed by summing over internal nodes
- This can be done efficiently using a tree upward traversal pass.

- Define P(Lk|a)= prob. of leaves below node k given that xk=a
- Init: for leaves: P(Lk|a)=1 if xk=a ; 0 otherwise
- Iteration: if k is node with children i and j, then
- Termination: Likelihood is

- Score each tree by
- Assumption of independent positions

- Branch lengths t can be optimized
- Gradient ascent
- EM

- We look for the highest scoring tree
- Exhaustive search
- Sampling methods (Metropolis)

T3

T4

T2

Tn

T1

- Perform search over possible topologies

Parameter space

Parametric optimization (EM)

Local Maxima

- Such procedures are computationally expensive!
- Computation of optimal parameters, per candidate, requires non-trivial optimization step.
- Spend non-negligible computation on a candidate, even if it is a low scoring one.
- In practice, such learning procedures can only consider small sets of candidate structures

Idea:Use parameters found for current topology to help evaluate new topologies.

Outline:

- Perform search in (T, t) space.
- Use EM-like iterations:
- E-step: use current solution to compute expected sufficient statistics for all topologies
- M-step: select new topology based on these expected sufficient statistics

Si,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,j

Define:

Find: topology T that maximizes

F is a linear function of Si,j

Suppose we observe H, the ancestral sequences.

- Start with a tree (T0,t0)
- Compute
Formal justification:

- Define:
Theorem:

Consequence: improvement in expected score improvement in likelihood

Weights:

Original Tree (T0,t0)

Compute:

Unlike standard EM for trees, we compute all possible pairwise

statistics

Time: O(N2M)

Weights:

Find:

Compute:

Pairwise weights

This stage also computes the branch length for each pair (i,j)

Weights:

Find:

Construct bifurcation T1

Compute:

Max. Spanning Tree

Fast greedy procedure to find tree

By construction:

Q(T’,t’) Q(T0,t0)

Thus,

l(T’,t’) l(T0,t0)

Weights:

Find:

Construct bifurcation T1

Compute:

Fix Tree

Remove redundant nodes

Add nodes to break large degree

This operation preserves likelihood

l(T1,t’) =l(T’,t’) l(T0,t0)

- Often we don’t trust the tree found as the “correct” one.
- Bootstrapping:
- Sample (with replacement) n positions from the alignment
- Learn the best tree for each sample
- Look for tree features which are frequent in all trees.

- For some models this procedure approximates the tree posterior P(T| X1,…,Xn)

Weights:

Find:

Compute:

Construct bifurcation T1

New Tree

Thm: l(T1,t1) l(T0,t0)

These steps are then repeated until convergence