Phylogeny ii parsimony ml semphy
This presentation is the property of its rightful owner.
Sponsored Links
1 / 44

Phylogeny II : Parsimony, ML, SEMPHY PowerPoint PPT Presentation


  • 42 Views
  • Uploaded on
  • Presentation posted in: General

Phylogeny II : Parsimony, ML, SEMPHY. branch. internal node. leaf. Phylogenetic Tree. Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2. Character Based Methods. We start with a multiple alignments Assumptions: All sequences are homologous

Download Presentation

Phylogeny II : Parsimony, ML, SEMPHY

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Phylogeny ii parsimony ml semphy

Phylogeny II : Parsimony, ML, SEMPHY

.


Phylogenetic tree

branch

internal node

leaf

Phylogenetic Tree

  • Topology: bifurcating

    • Leaves - 1…N

    • Internal nodes N+1…2N-2


Character based methods

Character Based Methods

  • We start with a multiple alignments

  • Assumptions:

    • All sequences are homologous

    • Each position in alignment is homologous

    • Positions evolve independently

    • No gaps

  • Seek to explain the evolution of each position in the alignment


Parsimony

Parsimony

  • Character-based method

    Assumptions:

  • Independence of characters (no interactions)

  • Best tree is one where minimal changes take place


Simple example

Simple Example

  • Suppose we have five species, such that three have ‘C’ and two ‘T’ at a specified position

  • Minimal tree has one evolutionary change:

C

T

C

T

C

C

C

T

T  C


Another example

Aardvark

Bison

Chimp

Dog

Elephant

Another Example

  • What is the parsimony score of

A: CAGGTA

B: CAGACA

C: CGGGTA

D: TGCACT

E: TGCGTA


Evaluating parsimony scores

Evaluating Parsimony Scores

  • How do we compute the Parsimony score for a given tree?

  • Traditional Parsimony

    • Each base change has a cost of 1

  • Weighted Parsimony

    • Each change is weighted by the score c(a,b)


Traditional parsimony

a

g

a

Traditional Parsimony

a

{a}

  • Solved independently for each position

  • Linear time solution

a

{a,g}


Evaluating weighted parsimony

Evaluating Weighted Parsimony

Dynamic programming on the tree

Initialization:

  • For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) = 

    Iteration:

  • if k is node with children i and j, then S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b))

    Termination:

  • cost of tree is minaS(r,a) where r is the root


Cost of evaluating parsimony

Cost of Evaluating Parsimony

  • Score is evaluated on each position independetly. Scores are then summed over all positions.

  • If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk)

  • By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node


Maximum parsimony

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G

How many possible unrooted trees?


Maximum parsimony1

Maximum Parsimony

How many possible unrooted trees?

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G


Maximum parsimony2

Maximum Parsimony

How many substitutions?

MP


Maximum parsimony3

0

0

0

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G


Maximum parsimony4

0 3

0 3

0 3

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G


Maximum parsimony5

G

T

3

C

A

C

G

C

3

T

A

C

G

T

3

A

C

C

Maximum Parsimony

4

1 - G

2 - C

3 - T

4 - A


Maximum parsimony6

0 3 2

0 3 2

0 3 2

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G


Maximum parsimony7

0 3 2 2

0 3 2 2

0 3 2 1

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G


Maximum parsimony8

G

A

2

A

G

A

G

A

2

A

G

A

A

G

1

A

G

A

Maximum Parsimony

4

1 - G

2 - A

3 - A

4 - G


Maximum parsimony9

0 3 2 2 0 1 1 1 1 3 14

0 3 2 2 0 1 2 1 2 3 16

0 3 2 1 0 1 2 1 2 3 15

Maximum Parsimony


Maximum parsimony10

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3 2 2 0 1 1 1 1 3 14


Searching for trees

Searching for Trees


Searching for the optimal tree

Searching for the Optimal Tree

  • Exhaustive Search

    • Very intensive

  • Branch and Bound

    • A compromise

  • Heuristic

    • Fast

    • Usually starts with NJ


Phylogenetic tree assumptions

branch

internal node

leaf

Phylogenetic Tree Assumptions

  • Topology: bifurcating

    • Leaves - 1…N

    • Internal nodes N+1…2N-2

  • Lengths t = {ti} for each branch

  • Phylogenetic tree = (Topology, Lengths) = (T,t)


Probabilistic methods

Probabilistic Methods

  • The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences.

  • Background probabilities: q(a)

  • Mutation probabilities: P(a|b, t)

  • Models for evolutionary mutations

    • Jukes Cantor

    • Kimura 2-parameter model

  • Such models are used to derive the probabilities


Jukes cantor model

Jukes Cantor model

  • A model for mutation rates

  • Mutation occurs at a constant rate

  • Each nucleotide is equally likely to mutate into any other nucleotide with rate a.


Kimura 2 parameter model

Kimura 2-parameter model

  • Allows a different rate for transitions and transversions.


Mutation probabilities

Mutation Probabilities

  • The rate matrix R is used to derive the mutation probability matrix S:

  • S is obtained by integration. For Jukes Cantor:

  • q can be obtained by setting t to infinity


Mutation probabilities1

A

C

G

T

Mutation Probabilities

  • Both models satisfy the following properties:

  • Lack of memory:

  • Reversibility:

    • Exist stationary probabilities {Pa} s.t.


Probabilistic approach

Probabilistic Approach

  • Given P,q, the tree topology and branch lengths, we can compute:

x5

t4

x4

t2

t3

t1

x1

x2

x3


Computing the tree likelihood

Computing the Tree Likelihood

  • We are interested in the probability of observed data given tree and branch “lengths”:

  • Computed by summing over internal nodes

  • This can be done efficiently using a tree upward traversal pass.


Tree likelihood computation

Tree Likelihood Computation

  • Define P(Lk|a)= prob. of leaves below node k given that xk=a

  • Init: for leaves: P(Lk|a)=1 if xk=a ; 0 otherwise

  • Iteration: if k is node with children i and j, then

  • Termination:Likelihood is


Maximum likelihood ml

Maximum Likelihood (ML)

  • Score each tree by

    • Assumption of independent positions

  • Branch lengths t can be optimized

    • Gradient ascent

    • EM

  • We look for the highest scoring tree

    • Exhaustive

    • Sampling methods (Metropolis)


Optimal tree search

T3

T4

T2

Tn

T1

Optimal Tree Search

  • Perform search over possible topologies

Parameter space

Parametric optimization (EM)

Local Maxima


Computational problem

Computational Problem

  • Such procedures are computationally expensive!

  • Computation of optimal parameters, per candidate, requires non-trivial optimization step.

  • Spend non-negligible computation on a candidate, even if it is a low scoring one.

  • In practice, such learning procedures can only consider small sets of candidate structures


Structural em

Structural EM

Idea:Use parameters found for current topology to help evaluate new topologies.

Outline:

  • Perform search in (T, t) space.

  • Use EM-like iterations:

    • E-step: use current solution to compute expected sufficient statistics for all topologies

    • M-step: select new topology based on these expected sufficient statistics


The complete data scenario

Si,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,j

Define:

Find: topology T that maximizes

F is a linear function of Si,j

The Complete-Data Scenario

Suppose we observe H, the ancestral sequences.


Expected likelihood

Expected Likelihood

  • Start with a tree (T0,t0)

  • Compute

    Formal justification:

  • Define:

    Theorem:

    Consequence: improvement in expected score improvement in likelihood


Algorithm outline

Weights:

Original Tree (T0,t0)

Compute:

Algorithm Outline

Unlike standard EM for trees, we compute all possible pairwise

statistics

Time: O(N2M)


Algorithm outline1

Weights:

Find:

Compute:

Algorithm Outline

Pairwise weights

This stage also computes the branch length for each pair (i,j)


Algorithm outline2

Weights:

Find:

Construct bifurcation T1

Compute:

Algorithm Outline

Max. Spanning Tree

Fast greedy procedure to find tree

By construction:

Q(T’,t’)  Q(T0,t0)

Thus,

l(T’,t’)  l(T0,t0)


Algorithm outline3

Weights:

Find:

Construct bifurcation T1

Compute:

Algorithm Outline

Fix Tree

Remove redundant nodes

Add nodes to break large degree

This operation preserves likelihood

l(T1,t’) =l(T’,t’)  l(T0,t0)


Assessing trees the bootstrap

Assessing trees: the Bootstrap

  • Often we don’t trust the tree found as the “correct” one.

  • Bootstrapping:

    • Sample (with replacement) n positions from the alignment

    • Learn the best tree for each sample

    • Look for tree features which are frequent in all trees.

  • For some models this procedure approximates the tree posterior P(T| X1,…,Xn)


Algorithm outline4

Weights:

Find:

Compute:

Algorithm Outline

Construct bifurcation T1

New Tree

Thm: l(T1,t1)  l(T0,t0)

These steps are then repeated until convergence


  • Login