Loading in 2 Seconds...

Generalized Tree Alignment: The Deferred Path Heuristic

Loading in 2 Seconds...

- By
**jude** - Follow User

- 70 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Generalized Tree Alignment: The Deferred Path Heuristic' - jude

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Overview:

What is a phylogeny?

The Generalized Tree Alignment problem

Sequence Graphs and their algorithms

The Deferred Path Heuristic

Phylogeny:

Describes evolutionary model

- Common ancestor
- Mutations happen all the time
- Insertions, deletions, substitutions, translocations, inversions, duplications …

Most mutations happen in DNA replication

- Corrected by cell mechanisms

Mutations accumulate → new species diverge

Only mutations in sex cells are inherited (obviously)

Phylogeny:

Phylogenetic inference:

Given n sequences build a phylogenetic tree

Most methods base T on a multiple alignment

Likewise: Multiple alignments often based on guide trees

Can we solve both problems at the same time?

Phylogeny:

... or among a single taxon (here, human entovirus 71)

The Problem:

Given n sequences s1,…,sn …

Multiple Alignment:

Make an ordering A of the sequences by inserting gaps such that homologous bases are put in the same column

Phylogenetic Inference:

Build a (binary) tree T with s1,…,sn in the leaves and possible ancestors sn+1,…,sn+k in internal nodes describing their evolutionary connection

Generalized Tree Alignment:

Combines the two. The problem we want to solve is:

Given: A set of n sequences s1,…,sn from n different species (could be DNA, RNA or protein – for simplicity we focus on DNA)

Problem: Generate an unrooted phylogenetic tree T with sequences s1,…,sn in the leaves and a multiple alignment A of these sequences

Placing the root is not trivial and is best left to biologists.

The given problem is proven to be MAXSNP-hard (Wang and Jiang, 1994)

→ Not possible to find an approximation algorithm.

Exact solutions to NP-hard problems are intractable

→ The best we can hope for is a heuristic

The given algorithm runs in time O(n2.ln)

- n: The number of sequences
- l: Their maximum length.

Sequence graphs (Hein, 1989):

Recall pairwise alignment.

Traceback ”spells” possible optimal alignments:

Sequence graphs:

Make graph with alignment columns as edge labels

→ represents all optimal alignments

We will get back to that shortly …

Right now, we want to represent sequences

Let us introduce sequence graphs.

For instance, s = ACTGTA is represented by:

Sequence graphs:

More formally:

- Directed, acyclic graph.
- Edge labels lfrom alphabet Σ. Here, Σ={A,C,G,T,-}
- Source s: The unique node with no incoming edges
- Sink t: The unique node with no outgoing edges.
- Each path from s to t spells a sequence.

Sequence graphs:

Represents a set of sequences given by all paths from s to t:

Sequence graphs:

Any single sequence can be represented by a linear sequence graph

Any set of k sequences can be represented by making k paths from s to t

A given sequence s’ can be represented by more than one path

We can now represent sequences – but can we align them?

Aligning sequence graphs:

Dynamic programming algorithm inspired by basic

Pairwise Alignment:

- Given two sequences p and q
- Move one letter in p and move through q finding the optimal ”partial alignments”

Sequence Graphs:

- Given two sequence graphs G1 and G2
- We can have many outgoing edges to choose from

Aligning sequence graphs:

Fill in a |V1|*|V2| score matrix

For each pair of nodes i from G1 and j from G2:

Should we:

- Align the two characters we got by following e1 into i and e2 into j?
- Stay in G1 and only move in G2?
- Stay in G2 and only move in G1?
- Or have we already found a better path into i and j?

Optimal Alignment Graphs:

Now we need a way to remember the optimal alignments

Recall graphs from before:

- Directed, acyclic graphs
- Nodes s and t defined as before
- Edge labels of the form [la,lb] where la,lb∊Σ

Backtrack through the matrix and consider each possible combination of edges.

Optimal Alignment Graphs:

An example of an OAG:

This one represents the alignments:

We denote such a graph A*

We have to convert the OAGs back to SGs

Optimal Alignment Graphs:

This is done easily by considering the edge labels:

If la= lb: Make a single edge in the SG with label la

If la≠lb: Make two edges in the SG: One with label la and one with label lb

The graph from before turns into the SG:

Summing up Sequence Graphs:

Final graph represents all sequences giving an optimal alignment between G1 and G2

We can:

- Represent a set of sequences by a sequence graph
- Align two such graphs producing a new SG

We can now get on with the main algorithm

The basic idea:

- Start by comparing all sequences
- Find a closest pair.
- Represent all sequences giving the optimal solution
- Defer the choice of a single sequence
- Repeat, but this time include the set of sequences
- In the end: Choose a single sequence and backtrack

This shows a need for:

- A compact representation of many sequences
- An algorithm for aligning sets of sequences

The Deferred Path Heuristic:

Similar to Kruskal’s algorithm for finding MSTs:

From sequences s1,…,sn,initialize n SGs G1,…,Gn.

Until only two SGs remain:

- Align all pairs and choose a closest pair Gi and Gj
- Create A*(Gi,Gj) and convert A* into a SG Gk.
- Replace Gi and Gj with Gk

Note that we remember all candidate sequences

The Deferred Path Heuristic:

When only two SGs Gi and Gj remain:

- Align them and connect them in T
- Choose some optimal alignment
- This gives si and sj in the root of the two subtrees.
- Backtrack through the subtrees
- At each step: Align sk to the underlying SGs.
- Choose some optimal alignment

The Deferred Path Heuristic:

We defer our choice of actual sequences until the last moment, thereby enlarging our solution space:

Download Presentation

Connecting to Server..