Generalized Tree Alignment: The Deferred Path Heuristic

1 / 24

# Generalized Tree Alignment: The Deferred Path Heuristic - PowerPoint PPT Presentation

Generalized Tree Alignment: The Deferred Path Heuristic. Stinus Lindgreen [email protected] Overview: What is a phylogeny? The Generalized Tree Alignment problem Sequence Graphs and their algorithms The Deferred Path Heuristic. Phylogeny: Describes evolutionary model Common ancestor

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Generalized Tree Alignment: The Deferred Path Heuristic' - jude

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Generalized Tree Alignment:The Deferred Path Heuristic

Stinus Lindgreen

[email protected]

Overview:

What is a phylogeny?

The Generalized Tree Alignment problem

Sequence Graphs and their algorithms

The Deferred Path Heuristic

Phylogeny:

Describes evolutionary model

• Common ancestor
• Mutations happen all the time
• Insertions, deletions, substitutions, translocations, inversions, duplications …

Most mutations happen in DNA replication

• Corrected by cell mechanisms

Mutations accumulate → new species diverge

Only mutations in sex cells are inherited (obviously)

Phylogeny:

Phylogenetic inference:

Given n sequences build a phylogenetic tree

Most methods base T on a multiple alignment

Likewise: Multiple alignments often based on guide trees

Can we solve both problems at the same time?

Phylogeny:

Describes the evolutionary relationship between species

Notice root

Phylogeny:

... or among a single taxon (here, human entovirus 71)

The Problem:

Given n sequences s1,…,sn …

Multiple Alignment:

Make an ordering A of the sequences by inserting gaps such that homologous bases are put in the same column

Phylogenetic Inference:

Build a (binary) tree T with s1,…,sn in the leaves and possible ancestors sn+1,…,sn+k in internal nodes describing their evolutionary connection

Generalized Tree Alignment:

Combines the two. The problem we want to solve is:

Given: A set of n sequences s1,…,sn from n different species (could be DNA, RNA or protein – for simplicity we focus on DNA)

Problem: Generate an unrooted phylogenetic tree T with sequences s1,…,sn in the leaves and a multiple alignment A of these sequences

Placing the root is not trivial and is best left to biologists.

→ Not possible to find an approximation algorithm.

Exact solutions to NP-hard problems are intractable

→ The best we can hope for is a heuristic

The given algorithm runs in time O(n2.ln)

• n: The number of sequences
• l: Their maximum length.
Sequence graphs (Hein, 1989):

Recall pairwise alignment.

Traceback ”spells” possible optimal alignments:

Sequence graphs:

Make graph with alignment columns as edge labels

→ represents all optimal alignments

We will get back to that shortly …

Right now, we want to represent sequences

Let us introduce sequence graphs.

For instance, s = ACTGTA is represented by:

Sequence graphs:

More formally:

• Directed, acyclic graph.
• Edge labels lfrom alphabet Σ. Here, Σ={A,C,G,T,-}
• Source s: The unique node with no incoming edges
• Sink t: The unique node with no outgoing edges.
• Each path from s to t spells a sequence.
Sequence graphs:

Represents a set of sequences given by all paths from s to t:

Sequence graphs:

Any single sequence can be represented by a linear sequence graph

Any set of k sequences can be represented by making k paths from s to t

A given sequence s’ can be represented by more than one path

We can now represent sequences – but can we align them?

Aligning sequence graphs:

Dynamic programming algorithm inspired by basic

Pairwise Alignment:

• Given two sequences p and q
• Move one letter in p and move through q finding the optimal ”partial alignments”

Sequence Graphs:

• Given two sequence graphs G1 and G2
• We can have many outgoing edges to choose from
Aligning sequence graphs:

Fill in a |V1|*|V2| score matrix

For each pair of nodes i from G1 and j from G2:

Should we:

• Align the two characters we got by following e1 into i and e2 into j?
• Stay in G1 and only move in G2?
• Stay in G2 and only move in G1?
• Or have we already found a better path into i and j?
Optimal Alignment Graphs:

Now we need a way to remember the optimal alignments

Recall graphs from before:

• Directed, acyclic graphs
• Nodes s and t defined as before
• Edge labels of the form [la,lb] where la,lb∊Σ

Backtrack through the matrix and consider each possible combination of edges.

Optimal Alignment Graphs:

An example of an OAG:

This one represents the alignments:

We denote such a graph A*

We have to convert the OAGs back to SGs

Optimal Alignment Graphs:

This is done easily by considering the edge labels:

If la= lb: Make a single edge in the SG with label la

If la≠lb: Make two edges in the SG: One with label la and one with label lb

The graph from before turns into the SG:

Summing up Sequence Graphs:

Final graph represents all sequences giving an optimal alignment between G1 and G2

We can:

• Represent a set of sequences by a sequence graph
• Align two such graphs producing a new SG

We can now get on with the main algorithm

The basic idea:
• Start by comparing all sequences
• Find a closest pair.
• Represent all sequences giving the optimal solution
• Defer the choice of a single sequence
• Repeat, but this time include the set of sequences
• In the end: Choose a single sequence and backtrack

This shows a need for:

• A compact representation of many sequences
• An algorithm for aligning sets of sequences
The Deferred Path Heuristic:

Similar to Kruskal’s algorithm for finding MSTs:

From sequences s1,…,sn,initialize n SGs G1,…,Gn.

Until only two SGs remain:

• Align all pairs and choose a closest pair Gi and Gj
• Create A*(Gi,Gj) and convert A* into a SG Gk.
• Replace Gi and Gj with Gk

Note that we remember all candidate sequences

The Deferred Path Heuristic:

When only two SGs Gi and Gj remain:

• Align them and connect them in T
• Choose some optimal alignment
• This gives si and sj in the root of the two subtrees.
• Backtrack through the subtrees
• At each step: Align sk to the underlying SGs.
• Choose some optimal alignment
The Deferred Path Heuristic:

We defer our choice of actual sequences until the last moment, thereby enlarging our solution space: