Loading in 2 Seconds...
Loading in 2 Seconds...
Generalized Tree Alignment: The Deferred Path Heuristic. Stinus Lindgreen [email protected] Overview: What is a phylogeny? The Generalized Tree Alignment problem Sequence Graphs and their algorithms The Deferred Path Heuristic. Phylogeny: Describes evolutionary model Common ancestor
What is a phylogeny?
The Generalized Tree Alignment problem
Sequence Graphs and their algorithms
The Deferred Path Heuristic
Describes evolutionary model
Most mutations happen in DNA replication
Mutations accumulate → new species diverge
Only mutations in sex cells are inherited (obviously)
Given n sequences build a phylogenetic tree
Most methods base T on a multiple alignment
Likewise: Multiple alignments often based on guide trees
Can we solve both problems at the same time?
Describes the evolutionary relationship between species
... or among a single taxon (here, human entovirus 71)
Given n sequences s1,…,sn …
Make an ordering A of the sequences by inserting gaps such that homologous bases are put in the same column
Build a (binary) tree T with s1,…,sn in the leaves and possible ancestors sn+1,…,sn+k in internal nodes describing their evolutionary connection
Combines the two. The problem we want to solve is:
Given: A set of n sequences s1,…,sn from n different species (could be DNA, RNA or protein – for simplicity we focus on DNA)
Problem: Generate an unrooted phylogenetic tree T with sequences s1,…,sn in the leaves and a multiple alignment A of these sequences
Placing the root is not trivial and is best left to biologists.
→ Not possible to find an approximation algorithm.
Exact solutions to NP-hard problems are intractable
→ The best we can hope for is a heuristic
The given algorithm runs in time O(n2.ln)
Recall pairwise alignment.
Traceback ”spells” possible optimal alignments:
Make graph with alignment columns as edge labels
→ represents all optimal alignments
We will get back to that shortly …
Right now, we want to represent sequences
Let us introduce sequence graphs.
For instance, s = ACTGTA is represented by:
Represents a set of sequences given by all paths from s to t:
Any single sequence can be represented by a linear sequence graph
Any set of k sequences can be represented by making k paths from s to t
A given sequence s’ can be represented by more than one path
We can now represent sequences – but can we align them?
Dynamic programming algorithm inspired by basic
Fill in a |V1|*|V2| score matrix
For each pair of nodes i from G1 and j from G2:
Now we need a way to remember the optimal alignments
Recall graphs from before:
Backtrack through the matrix and consider each possible combination of edges.
An example of an OAG:
This one represents the alignments:
We denote such a graph A*
We have to convert the OAGs back to SGs
This is done easily by considering the edge labels:
If la= lb: Make a single edge in the SG with label la
If la≠lb: Make two edges in the SG: One with label la and one with label lb
The graph from before turns into the SG:
Final graph represents all sequences giving an optimal alignment between G1 and G2
We can now get on with the main algorithm
This shows a need for:
Similar to Kruskal’s algorithm for finding MSTs:
From sequences s1,…,sn,initialize n SGs G1,…,Gn.
Until only two SGs remain:
Note that we remember all candidate sequences
When only two SGs Gi and Gj remain:
We defer our choice of actual sequences until the last moment, thereby enlarging our solution space: