- 46 Views
- Uploaded on
- Presentation posted in: General

Phylogeny II : Parsimony, ML, SEMPHY

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Phylogeny II : Parsimony, ML, SEMPHY

.

branch

internal node

leaf

- Topology: bifurcating
- Leaves - 1…N
- Internal nodes N+1…2N-2

- We start with a multiple alignments
- Assumptions:
- All sequences are homologous
- Each position in alignment is homologous
- Positions evolve independently
- No gaps

- Seek to explain the evolution of each position in the alignment

- Character-based method
Assumptions:

- Independence of characters (no interactions)
- Best tree is one where minimal changes take place

- Suppose we have five species, such that three have ‘C’ and two ‘T’ at a specified position
- Minimal tree has one evolutionary change:

C

T

C

T

C

C

C

T

T C

Aardvark

Bison

Chimp

Dog

Elephant

- What is the parsimony score of

A: CAGGTA

B: CAGACA

C: CGGGTA

D: TGCACT

E: TGCGTA

- How do we compute the Parsimony score for a given tree?
- Traditional Parsimony
- Each base change has a cost of 1

- Weighted Parsimony
- Each change is weighted by the score c(a,b)

a

g

a

a

{a}

- Solved independently for each position
- Linear time solution

a

{a,g}

Dynamic programming on the tree

Initialization:

- For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) =
Iteration:

- if k is node with children i and j, then S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b))
Termination:

- cost of tree is minaS(r,a) where r is the root

- Score is evaluated on each position independetly. Scores are then summed over all positions.
- If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk)
- By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G

How many possible unrooted trees?

How many possible unrooted trees?

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G

How many substitutions?

MP

0

0

0

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3

0 3

0 3

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

G

T

3

C

A

C

G

C

3

T

A

C

G

T

3

A

C

C

4

1 - G

2 - C

3 - T

4 - A

0 3 2

0 3 2

0 3 2

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3 2 2

0 3 2 2

0 3 2 1

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

G

A

2

A

G

A

G

A

2

A

G

A

A

G

1

A

G

A

4

1 - G

2 - A

3 - A

4 - G

0 3 2 2 0 1 1 1 1 3 14

0 3 2 2 0 1 2 1 2 3 16

0 3 2 1 0 1 2 1 2 3 15

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3 2 2 0 1 1 1 1 3 14

- Exhaustive Search
- Very intensive

- Branch and Bound
- A compromise

- Heuristic
- Fast
- Usually starts with NJ

branch

internal node

leaf

- Topology: bifurcating
- Leaves - 1…N
- Internal nodes N+1…2N-2

- Lengths t = {ti} for each branch
- Phylogenetic tree = (Topology, Lengths) = (T,t)

- The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences.
- Background probabilities: q(a)
- Mutation probabilities: P(a|b, t)
- Models for evolutionary mutations
- Jukes Cantor
- Kimura 2-parameter model

- Such models are used to derive the probabilities

- A model for mutation rates

- Mutation occurs at a constant rate
- Each nucleotide is equally likely to mutate into any other nucleotide with rate a.

- Allows a different rate for transitions and transversions.

- The rate matrix R is used to derive the mutation probability matrix S:
- S is obtained by integration. For Jukes Cantor:
- q can be obtained by setting t to infinity

A

C

G

T

- Both models satisfy the following properties:
- Lack of memory:
- Reversibility:
- Exist stationary probabilities {Pa} s.t.

- Given P,q, the tree topology and branch lengths, we can compute:

x5

t4

x4

t2

t3

t1

x1

x2

x3

- We are interested in the probability of observed data given tree and branch “lengths”:
- Computed by summing over internal nodes
- This can be done efficiently using a tree upward traversal pass.

- Define P(Lk|a)= prob. of leaves below node k given that xk=a
- Init: for leaves: P(Lk|a)=1 if xk=a ; 0 otherwise
- Iteration: if k is node with children i and j, then
- Termination:Likelihood is

- Score each tree by
- Assumption of independent positions

- Branch lengths t can be optimized
- Gradient ascent
- EM

- We look for the highest scoring tree
- Exhaustive
- Sampling methods (Metropolis)

T3

T4

T2

Tn

T1

- Perform search over possible topologies

Parameter space

Parametric optimization (EM)

Local Maxima

- Such procedures are computationally expensive!
- Computation of optimal parameters, per candidate, requires non-trivial optimization step.
- Spend non-negligible computation on a candidate, even if it is a low scoring one.
- In practice, such learning procedures can only consider small sets of candidate structures

Idea:Use parameters found for current topology to help evaluate new topologies.

Outline:

- Perform search in (T, t) space.
- Use EM-like iterations:
- E-step: use current solution to compute expected sufficient statistics for all topologies
- M-step: select new topology based on these expected sufficient statistics

Si,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,j

Define:

Find: topology T that maximizes

F is a linear function of Si,j

Suppose we observe H, the ancestral sequences.

- Start with a tree (T0,t0)
- Compute
Formal justification:

- Define:
Theorem:

Consequence: improvement in expected score improvement in likelihood

Weights:

Original Tree (T0,t0)

Compute:

Unlike standard EM for trees, we compute all possible pairwise

statistics

Time: O(N2M)

Weights:

Find:

Compute:

Pairwise weights

This stage also computes the branch length for each pair (i,j)

Weights:

Find:

Construct bifurcation T1

Compute:

Max. Spanning Tree

Fast greedy procedure to find tree

By construction:

Q(T’,t’) Q(T0,t0)

Thus,

l(T’,t’) l(T0,t0)

Weights:

Find:

Construct bifurcation T1

Compute:

Fix Tree

Remove redundant nodes

Add nodes to break large degree

This operation preserves likelihood

l(T1,t’) =l(T’,t’) l(T0,t0)

- Often we don’t trust the tree found as the “correct” one.
- Bootstrapping:
- Sample (with replacement) n positions from the alignment
- Learn the best tree for each sample
- Look for tree features which are frequent in all trees.

- For some models this procedure approximates the tree posterior P(T| X1,…,Xn)

Weights:

Find:

Compute:

Construct bifurcation T1

New Tree

Thm: l(T1,t1) l(T0,t0)

These steps are then repeated until convergence