Genome-scale disk-based suffix tree indexing

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus

Introduction… • Growth in biological sequences database • Need for effective and efficient structure • Suffix Tree • Exact/approx. matching • Database querying • Longest common substrings etc.

Introduction… • In-memory construction algorithms • O(n2) • Can achieve Linear Time and Space • suffix links • edge encoding • skip and count • Problem: do not scale for large input sequences

Disk based Suffix trees • “A Database Index to Large Biological Sequences” • Abandon suffix links (for better locality of reference) • Partition input based on fixed length prefixes • Faces problem in partition size because of data skew • Use of bin packing for partitions: expensive to count frequency for long length prefixes • “Practical Suffix Tree Construction” • TDD: Similar to above… drops suffix links • Reported to scale to human genome level • Random I/Os when input string size > memory

Disk based Suffix trees • ST-Merge (Improvement to TDD) • Input string = smaller contiguous substrings • Apply TDD on each substring and then merge all trees • Does not have suffix links • TOP-Q and DynaCluster • Only known algorithms that maintain suffix links and do not have data skew problem • Experiments show that they do not scale to human genome level

Issue • Problems with disk based algorithms • Data skew • No Suffix Links • No scalability Authors propose a novel disk based suffix tree algorithm called TRELLIS

TRELLIS • O(n2) Time, O(n) Space • Idea: • construct by partitioning and merging • use variable length prefixes • Recover suffix links in a different post construction phase • Effectively scales up to human genome level • Can index entire human genome using 2GB in 4 hours, recover suffix links in 2 hours

TRELLIS • Has 4 different phases • Prefix Creation • Partitioning • Merging • Suffix Link Recovery

Prefix Creation Phase • Problems with fixed-length prefix • Cannot handle data skew • Computing appropriate length is not defined • TRELLIS makes use of variable length prefixes. P = {P0, P1, P2, …, Pm-1} Use some threshold t to determine P such that freq(Pi) ≤ t

Prefix Creation Phase • Multi-scan approach to compute P • ith scan • Process prefixes up to certain length Li (See formula below to calculate Li) • EPi = set of prefixes that need further extension in next scan (as their frequency > t) • Add to P only the smallest length prefixes that meets the frequency threshold t and reject their extensions

Prefix Creation Phase • Ex: With t = 106, only two stages were required for the human genome with L1=8 and L2=16 Resulting set P contained about 6400 prefixes of lengths in the range 4 to 16

Partitioning Phase • Divide input string into r consecutive partitions where r = (n+1) / t • Suffix Subtree TRi • Contains suffixes that start in partition Ri • Use Ukkonen’s algorithm* to build it • Prefixed Suffix Subtree TRi, Pk • Split TRi into subtrees that contain only suffixes that have prefix Pk • At most m such subtrees • Store these prefixed suffix subtrees on disk * proposed in the paper “Online construction of suffix trees” – E. Ukkonen

Partitioning Phase • TRis obtained are implicitsuffix trees (i.e. some suffixes are part of internal edges) • To guarantee that TRi explicitly contains all suffixes from ith partition • Continue to read some characters from next partition Ri+1 until t leaves are obtained in TRi • Cannot do special character appending as it will incur additional overhead during merging phase

Merging Phase • For each prefix Pk in the set P • Merge all Prefixed Suffixed Subtrees TRi,Pk to get Prefixed Suffix Tree TPk • We get m Prefixed Suffix trees • Store the resulting trees back to disk

Suffix Link Recovery Phase • Why? • Suffix links are crucial for efficiency in many fast string processing algorithms • Why in a separate phase? • TRELLIS may discard all suffix links information during the merge phase as new internal nodes are created and some old ones are deleted • It is useful to discard suffix links information after partitioning as it reduces amount of data per node • Recovering links from scratch takes same time as keeping original link information

Suffix Link Recovery Phase • TRELLIS recovers suffix links of one Prefix Suffix Tree at a time • Start with children of root • Proceeding in a depth-first fashion, do the following for each internal node x • Locate p(x) and sl(p(x)) • Count from sl(p(x)) to locate sl(x), when found add link • Do this recursively for all children of x

Choosing t Note: t is threshold for Partition size also M >= n/4 + ((0.7 x 40) + 16)t + (0.7 x 40)t M = available main memory n/4 = memory for input (in compressed form) # internal nodes = 0.7(# external nodes) 40, 16 are sizes of internal and external nodes

Computational Complexity • Prefix Creation Phase • O(nL) time, where L = longest prefix length • O(n+|∑L+1|)space • Partitioning Phase • Input is broken into r partitions and each partition is of size t • O(t) time/space for each => r x O(t) = O(n) • Disk I/Os: O(r x m) since at most m prefixed suffix subtrees can be created for each partition

Computational Complexity • Merging Phase • Each merge operation can be O(p) where p = | longest common prefix | • Across all prefixes, merging = O(p x n) since number of tree nodes in suffix tree is bounded by n • In worst case p can be O(n), therefore merge = O(n2) • Disk I/Os: O(r x m)

Computational Complexity • Suffix Link Recovery Phase • Internal nodes in final suffix trees are O(n) • Constant set of operations for each suffix link recovery • Putting all together… • O(n2) time since most expensive is the merge phase • O(n) space

Experimental Setup • Compared to • TOP-Q and DynaCluster (maintain suffix links) • TDD (no suffix links) • Performed on Linux with • 2 GB RAM for human genome and 512 MB for others • 288 GB disk space • TRELLIS written in C++ and compiled with g++ • Other algorithms obtained from their authors

Experimental Results TRELLIS vs. TOP-Q and DynaCluster For 200 Mbp, DynaCluster did not terminate even after 8 hours, TRELLIS took 13 min

Experimental Results TRELLIS vs. TDD • TDD uses four different buffers (string, suffix, temp and tree) • 200 Mbp requires only last 2 buffers • Saves additional I/O incurred in other cases

Experimental Results TRELLIS vs. TDD • TDD is built using memory optimized suffix-tree method • Difference is not significant for human genome as TDD needs to be run in 64 bit mode

Experimental Results TRELLIS vs. TDD – Query time • TDD does not store edge length, determine by examining children • Internal node has pointer only to one child, so scan all children linearly for every query

Conclusions • TRELLIS • Solves data skew problem: variable length prefixes • Scales gracefully for very large sequence • No Disk I/O overhead as it works with suffix trees that are guaranteed to fit in memory • It exhibits faster construction and query times when compared to other disk based algorithms

Future Work • Plan to make TRELLIS applicable to wider range of alphabets (Ex: English alphabets) • No buffering strategy required for human genome, but start building one for use of a generalized suffix tree composed of many large genomes • Parallelize TRELLIS, since its partioning and merging steps seem ideally suited

Genome-scale disk-based suffix tree indexing

Genome-scale disk-based suffix tree indexing

Presentation Transcript

Large-scale genome projects

Pattern Matching: Suffix Tree Applications

Genome-Scale Mutagenesis

Genome-scale Disk-based Suffix Tree Indexing

Knowledge-based Analysis of Genome-scale Data

Genome-scale phylogenomics

Tree Indexing (1)

Tree-based Indexing

Suffix Tree Based Prediction for Pervasive Computing Environments

Tree-based Indexing

Graph Indexing: Tree + Δ ≥ Graph

Suffix Tree

Suffix tree and suffix array techniques for pattern analysis in strings

Faster Suffix Tree Construction With Missing Suffix Links

B-Tree and Hash Indexing

Genome Scale Family Based Association Testing using Condor

Indexing Genome Sequences

Trie/Suffix Trie/Suffix Tree

B-Tree Indexing

Disk Based Storage

Suffix Tree and Suffix Array

Genome Scale Family Based Association Testing using Condor