1 / 27

Genome-scale disk-based suffix tree indexing

Genome-scale disk-based suffix tree indexing . Benjarath Phoophakdee Mohammed J. Zaki. Compiled by: Amit Mahajan Chaitra Venus. Introduction…. Growth in biological sequences database Need for effective and efficient structure Suffix Tree Exact/approx. matching Database querying

otis
Download Presentation

Genome-scale disk-based suffix tree indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus

  2. Introduction… • Growth in biological sequences database • Need for effective and efficient structure • Suffix Tree • Exact/approx. matching • Database querying • Longest common substrings etc.

  3. Introduction… • In-memory construction algorithms • O(n2) • Can achieve Linear Time and Space • suffix links • edge encoding • skip and count • Problem: do not scale for large input sequences

  4. Disk based Suffix trees • “A Database Index to Large Biological Sequences” • Abandon suffix links (for better locality of reference) • Partition input based on fixed length prefixes • Faces problem in partition size because of data skew • Use of bin packing for partitions: expensive to count frequency for long length prefixes • “Practical Suffix Tree Construction” • TDD: Similar to above… drops suffix links • Reported to scale to human genome level • Random I/Os when input string size > memory

  5. Disk based Suffix trees • ST-Merge (Improvement to TDD) • Input string = smaller contiguous substrings • Apply TDD on each substring and then merge all trees • Does not have suffix links • TOP-Q and DynaCluster • Only known algorithms that maintain suffix links and do not have data skew problem • Experiments show that they do not scale to human genome level

  6. Issue • Problems with disk based algorithms • Data skew • No Suffix Links • No scalability Authors propose a novel disk based suffix tree algorithm called TRELLIS

  7. TRELLIS • O(n2) Time, O(n) Space • Idea: • construct by partitioning and merging • use variable length prefixes • Recover suffix links in a different post construction phase • Effectively scales up to human genome level • Can index entire human genome using 2GB in 4 hours, recover suffix links in 2 hours

  8. TRELLIS • Has 4 different phases • Prefix Creation • Partitioning • Merging • Suffix Link Recovery

  9. Prefix Creation Phase • Problems with fixed-length prefix • Cannot handle data skew • Computing appropriate length is not defined • TRELLIS makes use of variable length prefixes. P = {P0, P1, P2, …, Pm-1} Use some threshold t to determine P such that freq(Pi) ≤ t

  10. Prefix Creation Phase • Multi-scan approach to compute P • ith scan • Process prefixes up to certain length Li (See formula below to calculate Li) • EPi = set of prefixes that need further extension in next scan (as their frequency > t) • Add to P only the smallest length prefixes that meets the frequency threshold t and reject their extensions

  11. Prefix Creation Phase • Ex: With t = 106, only two stages were required for the human genome with L1=8 and L2=16 Resulting set P contained about 6400 prefixes of lengths in the range 4 to 16

  12. Partitioning Phase • Divide input string into r consecutive partitions where r = (n+1) / t • Suffix Subtree TRi • Contains suffixes that start in partition Ri • Use Ukkonen’s algorithm* to build it • Prefixed Suffix Subtree TRi, Pk • Split TRi into subtrees that contain only suffixes that have prefix Pk • At most m such subtrees • Store these prefixed suffix subtrees on disk * proposed in the paper “Online construction of suffix trees” – E. Ukkonen

  13. Partitioning Phase • TRis obtained are implicitsuffix trees (i.e. some suffixes are part of internal edges) • To guarantee that TRi explicitly contains all suffixes from ith partition • Continue to read some characters from next partition Ri+1 until t leaves are obtained in TRi • Cannot do special character appending as it will incur additional overhead during merging phase

  14. Merging Phase • For each prefix Pk in the set P • Merge all Prefixed Suffixed Subtrees TRi,Pk to get Prefixed Suffix Tree TPk • We get m Prefixed Suffix trees • Store the resulting trees back to disk

  15. Suffix Link Recovery Phase • Why? • Suffix links are crucial for efficiency in many fast string processing algorithms • Why in a separate phase? • TRELLIS may discard all suffix links information during the merge phase as new internal nodes are created and some old ones are deleted • It is useful to discard suffix links information after partitioning as it reduces amount of data per node • Recovering links from scratch takes same time as keeping original link information

  16. Suffix Link Recovery Phase • TRELLIS recovers suffix links of one Prefix Suffix Tree at a time • Start with children of root • Proceeding in a depth-first fashion, do the following for each internal node x • Locate p(x) and sl(p(x)) • Count from sl(p(x)) to locate sl(x), when found add link • Do this recursively for all children of x

  17. Choosing t Note: t is threshold for Partition size also M >= n/4 + ((0.7 x 40) + 16)t + (0.7 x 40)t M = available main memory n/4 = memory for input (in compressed form) # internal nodes = 0.7(# external nodes) 40, 16 are sizes of internal and external nodes

  18. Computational Complexity • Prefix Creation Phase • O(nL) time, where L = longest prefix length • O(n+|∑L+1|)space • Partitioning Phase • Input is broken into r partitions and each partition is of size t • O(t) time/space for each => r x O(t) = O(n) • Disk I/Os: O(r x m) since at most m prefixed suffix subtrees can be created for each partition

  19. Computational Complexity • Merging Phase • Each merge operation can be O(p) where p = | longest common prefix | • Across all prefixes, merging = O(p x n) since number of tree nodes in suffix tree is bounded by n • In worst case p can be O(n), therefore merge = O(n2) • Disk I/Os: O(r x m)

  20. Computational Complexity • Suffix Link Recovery Phase • Internal nodes in final suffix trees are O(n) • Constant set of operations for each suffix link recovery • Putting all together… • O(n2) time since most expensive is the merge phase • O(n) space

  21. Experimental Setup • Compared to • TOP-Q and DynaCluster (maintain suffix links) • TDD (no suffix links) • Performed on Linux with • 2 GB RAM for human genome and 512 MB for others • 288 GB disk space • TRELLIS written in C++ and compiled with g++ • Other algorithms obtained from their authors

  22. Experimental Results TRELLIS vs. TOP-Q and DynaCluster For 200 Mbp, DynaCluster did not terminate even after 8 hours, TRELLIS took 13 min

  23. Experimental Results TRELLIS vs. TDD • TDD uses four different buffers (string, suffix, temp and tree) • 200 Mbp requires only last 2 buffers • Saves additional I/O incurred in other cases

  24. Experimental Results TRELLIS vs. TDD • TDD is built using memory optimized suffix-tree method • Difference is not significant for human genome as TDD needs to be run in 64 bit mode

  25. Experimental Results TRELLIS vs. TDD – Query time • TDD does not store edge length, determine by examining children • Internal node has pointer only to one child, so scan all children linearly for every query

  26. Conclusions • TRELLIS • Solves data skew problem: variable length prefixes • Scales gracefully for very large sequence • No Disk I/O overhead as it works with suffix trees that are guaranteed to fit in memory • It exhibits faster construction and query times when compared to other disk based algorithms

  27. Future Work • Plan to make TRELLIS applicable to wider range of alphabets (Ex: English alphabets) • No buffering strategy required for human genome, but start building one for use of a generalized suffix tree composed of many large genomes • Parallelize TRELLIS, since its partioning and merging steps seem ideally suited

More Related