generalized tree alignment the deferred path heuristic
Skip this Video
Download Presentation
Generalized Tree Alignment: The Deferred Path Heuristic

Loading in 2 Seconds...

play fullscreen
1 / 24

Generalized Tree Alignment: The Deferred Path Heuristic - PowerPoint PPT Presentation

  • Uploaded on

Generalized Tree Alignment: The Deferred Path Heuristic. Stinus Lindgreen [email protected] Overview: What is a phylogeny? The Generalized Tree Alignment problem Sequence Graphs and their algorithms The Deferred Path Heuristic. Phylogeny: Describes evolutionary model Common ancestor

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Generalized Tree Alignment: The Deferred Path Heuristic' - jude

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

What is a phylogeny?

The Generalized Tree Alignment problem

Sequence Graphs and their algorithms

The Deferred Path Heuristic


Describes evolutionary model

  • Common ancestor
  • Mutations happen all the time
    • Insertions, deletions, substitutions, translocations, inversions, duplications …

Most mutations happen in DNA replication

  • Corrected by cell mechanisms

Mutations accumulate → new species diverge

Only mutations in sex cells are inherited (obviously)


Phylogenetic inference:

Given n sequences build a phylogenetic tree

Most methods base T on a multiple alignment

Likewise: Multiple alignments often based on guide trees

Can we solve both problems at the same time?


Describes the evolutionary relationship between species

Notice root


... or among a single taxon (here, human entovirus 71)

The Problem:

Given n sequences s1,…,sn …

Multiple Alignment:

Make an ordering A of the sequences by inserting gaps such that homologous bases are put in the same column

Phylogenetic Inference:

Build a (binary) tree T with s1,…,sn in the leaves and possible ancestors sn+1,…,sn+k in internal nodes describing their evolutionary connection

Generalized Tree Alignment:

Combines the two. The problem we want to solve is:

Given: A set of n sequences s1,…,sn from n different species (could be DNA, RNA or protein – for simplicity we focus on DNA)

Problem: Generate an unrooted phylogenetic tree T with sequences s1,…,sn in the leaves and a multiple alignment A of these sequences

Placing the root is not trivial and is best left to biologists.

The given problem is proven to be MAXSNP-hard (Wang and Jiang, 1994)

→ Not possible to find an approximation algorithm.

Exact solutions to NP-hard problems are intractable

→ The best we can hope for is a heuristic

The given algorithm runs in time O(n2.ln)

  • n: The number of sequences
  • l: Their maximum length.
Sequence graphs (Hein, 1989):

Recall pairwise alignment.

Traceback ”spells” possible optimal alignments:

Sequence graphs:

Make graph with alignment columns as edge labels

→ represents all optimal alignments

We will get back to that shortly …

Right now, we want to represent sequences

Let us introduce sequence graphs.

For instance, s = ACTGTA is represented by:

Sequence graphs:

More formally:

  • Directed, acyclic graph.
  • Edge labels lfrom alphabet Σ. Here, Σ={A,C,G,T,-}
  • Source s: The unique node with no incoming edges
  • Sink t: The unique node with no outgoing edges.
  • Each path from s to t spells a sequence.
Sequence graphs:

Represents a set of sequences given by all paths from s to t:

Sequence graphs:

Any single sequence can be represented by a linear sequence graph

Any set of k sequences can be represented by making k paths from s to t

A given sequence s’ can be represented by more than one path

We can now represent sequences – but can we align them?

Aligning sequence graphs:

Dynamic programming algorithm inspired by basic

Pairwise Alignment:

  • Given two sequences p and q
  • Move one letter in p and move through q finding the optimal ”partial alignments”

Sequence Graphs:

  • Given two sequence graphs G1 and G2
  • We can have many outgoing edges to choose from
Aligning sequence graphs:

Fill in a |V1|*|V2| score matrix

For each pair of nodes i from G1 and j from G2:

Should we:

  • Align the two characters we got by following e1 into i and e2 into j?
  • Stay in G1 and only move in G2?
  • Stay in G2 and only move in G1?
  • Or have we already found a better path into i and j?
Optimal Alignment Graphs:

Now we need a way to remember the optimal alignments

Recall graphs from before:

  • Directed, acyclic graphs
  • Nodes s and t defined as before
  • Edge labels of the form [la,lb] where la,lb∊Σ

Backtrack through the matrix and consider each possible combination of edges.

Optimal Alignment Graphs:

An example of an OAG:

This one represents the alignments:

We denote such a graph A*

We have to convert the OAGs back to SGs

Optimal Alignment Graphs:

This is done easily by considering the edge labels:

If la= lb: Make a single edge in the SG with label la

If la≠lb: Make two edges in the SG: One with label la and one with label lb

The graph from before turns into the SG:

Summing up Sequence Graphs:

Final graph represents all sequences giving an optimal alignment between G1 and G2

We can:

  • Represent a set of sequences by a sequence graph
  • Align two such graphs producing a new SG

We can now get on with the main algorithm

The basic idea:
  • Start by comparing all sequences
    • Find a closest pair.
  • Represent all sequences giving the optimal solution
    • Defer the choice of a single sequence
  • Repeat, but this time include the set of sequences
  • In the end: Choose a single sequence and backtrack

This shows a need for:

  • A compact representation of many sequences
  • An algorithm for aligning sets of sequences
The Deferred Path Heuristic:

Similar to Kruskal’s algorithm for finding MSTs:

From sequences s1,…,sn,initialize n SGs G1,…,Gn.

Until only two SGs remain:

  • Align all pairs and choose a closest pair Gi and Gj
  • Create A*(Gi,Gj) and convert A* into a SG Gk.
  • Replace Gi and Gj with Gk

Note that we remember all candidate sequences

The Deferred Path Heuristic:

When only two SGs Gi and Gj remain:

  • Align them and connect them in T
  • Choose some optimal alignment
    • This gives si and sj in the root of the two subtrees.
  • Backtrack through the subtrees
    • At each step: Align sk to the underlying SGs.
    • Choose some optimal alignment
The Deferred Path Heuristic:

We defer our choice of actual sequences until the last moment, thereby enlarging our solution space: