CSE 746 – Introduction to Bioinformatics Research Project

CSE 746 – Introduction to Bioinformatics Research Project Two methods of DNA Sequencing – Comparing and Intertwining Suffix Trees and De Bruijn Graphs for Sequence Assembly Dicle Öztürk 540110004

Suffix trees - Definition • Definition (Gusfield)

Suffix trees – uses and complexity • Useful in string search, text processing, tasks • More like bridge between exact and inexact pattern matching problems • Storing suffix trees requires more space than storing the string itself • It was Ukkonen who was the first to provide a linear time online construction of suffix trees • Can be used for the sorting stage of BWT

Suffix trees – Naïve algorithm Assuming a bounded alphabet, this algorithm runs in O(m^2) time. -- N1 : root (initially a leaf) Ni : assumption Ni+1 : inductive string constructed using Ni

Suffix Trees – Naïve algorithm find the longest path from root whose label matches a prefix of S[i+1...m]$ (Matching path is unique because no two edges out of a node can have labels that begin with the same char) if no further match is possible if in the middle of an edge (u,v) split the edge into two insert a new node w just after the last char on the edge that matched a char in S[i+1...m] (before the char mismatched) label the edges (u,w) and (w,v) accordingly endif create a new edge (w,i+1), thus creating a new leaf (i+1) label (w,i+1) with the unmatched part of the suffix S[i+1...m] endif

Suffix trees – Ukkonen`s algorithm • Ukkonen moves Mccreight`s work further, decreasing space complexity to linear and giving comprehendible definitions. • The tree is constructed online and in a left-to-right fashion, as opposed to Weiner`s method.

Suffix trees – Ukkonen`s algorithm • In Ukkonen`s algorithm, substrings are kept by their indices. • The trick is that the last index of suffixes are not defined, which are represented by leaves. • If w is a substring of the string s, w=(i,j) is actually w = s[i]...s[j]. • Thus, the suffix tree for s will have at most |s| leaves, guaranteeing linear complexity in space.

Suffix trees - Applications In the notes of (Lewis, usask.ca), some general applications of suffix trees in computational biology are mentioned, • Genome alignment • Signature selection • Finding a short sequence that is specific to individual genes • Searches for non-repeating segments • Finding an representing all tandem repeats

Suffix trees - Applications • (Riedl, 1994) gives a more detailed list of applications, • Suffix trees are useful on search, single sequence analysis and multiple sequence analysis • With the method they use, which is called Gestalt tree matching, homology-search applications are believed to outperform fastp and fasta

Suffix trees - Applications • Detection and occurrences of any number of short subsequences can be useful in enzyme cut-site determination • Generalised suffix tree of a set of sequences allows all of the sequences to be analysed simultaneously • Detection of common subsequences within a set of sequences can be applied to contig reassembly (Riedl, 1994)

Suffix trees - Applications • Finding the best match between the suffix of one read and the prefix of another can also be a fruitful task • Suffix-prefix overlaps can help for finding the shortest common superstrings of reads, especially in genome assembly • Suffix trees can be used to remove redundancies in string containment problems

De Bruijn Graphs • There exist strings which are called De Bruijn strings, which might have given some inspiration to the development of De Bruijn graphs and vice versa. • A De Bruijn string of order k is a non-empty string x which is defined over an alphabet A ( xϵA+) such that if each string on A of length k occurs once and only once in x. Like x=11001 where A = {0,1}.

De Bruijn Graphs • De Bruijn graphs models those kinds of strings where the nodes hold the substrings of length k-1 and edges have one character (leftover of k-length substring). If the two nodes are connected by an edge, the one being the source follows the other (it is a directed graph). • Building De Bruijn graphs is not a piece of cake but it has many applications in genome assembly

De Bruijn Graphs • De Bruijn graphs are useful in • Handling sequence variants like duplications, inversions and transpositions • Combining sequences if different length • Effective data compression even when the data has many redundant parts • Detecting and analysing structural variants from unassembled data

De Bruijn Graphs and Affix Trees Conceptually, the De Bruijn graph of a sequence can be considered as a simplification of that sequence's affix tree. • Each non-empty substring of a given sequence is mapped onto a separate node • Each node is connected by an edge to its longest prefix and by a suffix link to its longest suffix • Nodes corresponding to sequences of length 1 are directly connected to the root node, corresponding to t.

De Bruijn Graphs and Affix Trees • Root represents empty string. The first children are the sequences of length 1. • The analogy is built upon the atomic tree representation of (Giegerich, 1997) and the idea is mostly from (Maaβ, 2003). • It has been pointed out in (Zerbino, 2009) that traversing the De Bruijn graoh is equivalent to traversing the affix tree across its breadth

De Bruijn Graphs and Affix Trees Furthermore it says, If we rank the nodes by distance from the root, the k-mer nodes of the De Bruijn graph correspond to the nodes of rank k in the affix tree It is easy to demonstrate that two k-mers are connected in the De Bruijn graph iff the corresponding nodes in the affix tree are connected by a path composed of an edge and a suffix link, going through a node of rank k+1

De Bruijn Graphs and Affix Trees • (Giegreich, 1997) gives the definitions of suffix and prefix trees together with their special relationship. • Active suffixes and prefixes for the string t: • The active suffix of t ← its longest nested suffix denoted as α(t) • The active prefix of t ← its longest nested prefix denoted as α^-1(t) • Then, α(t^-1) = (α^-1(t))^-1

De Bruijn Graphs and Affix Trees • The tree is atomic of each of its edges is marked by a single char • So every node is explicit • The tree is actually a trie

De Bruijn Graphs and Affix Trees • (Maaβ, 2003) furthers this analogy-based idea of affix trees and gives some more insight into the issue • It says, • A suffix link is an auxiliary edge from node n to node m where m is the node such that path(m) is the longest proper prefix of path(n) represented by a node in the tree. • Suffix links are used to move from one node to another so that the represented string is shortened at the front. • It mentions also that in essence, it was actually (Blummer, 1998) who observed the dual structure of suffix trees.

Redundancies in the sequences • And finally, we can say that the redundancy in the suffix tree of some string should be as less as possible so that an efficient build-up of De Bruijn graph out that tree can be obtained. • To reduce redundancy, some compression methods can be applied but no loss should take place and reversal should be possible. The algorithm of Lempel and Ziv is advised to be an efficient tool for this task, running in O(n) time with suffix trees.

References [1] – Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Dan Gusfield, Cambridge University Press, Jan 15, 1997. [2] – Ukkonen E., On-line Construction of Suffix-Trees, Algorithmica vol 14(3), 1995. [3] - Algorithms on Strings, Maxime Crochemore and Christophe Hancart, Cambridge University Press, June 2007. [4] – Genome assembly and comparison using de Bruijn graphs, D.R. Zerbino, PhD Thesis, European Bioinformatics Institute, Darwin College, September, 2009. [5] – Giegerich R., and Kurtz S., From Ukkonen to McCreight and Weiner: A unifying view of linear-time sufﬁx tree construction, Algorithmica 19:331–353, 1997 [6] – Maaß, M. G., Linear bidirectional on-line construction of afﬁx trees, Algorithmica vol. 37(1), 2003. [7] – Bieganski, P., Riedl, J., Cartis, J.V., Retzel, E.F., Generalized Suffix Trees for Biological Sequence Data: Implementations and Applications, HICSS (5), 1994.

CSE 746 – Introduction to Bioinformatics Research Project