1 / 17

Genome-scale Disk-based Suffix Tree Indexing

Genome-scale Disk-based Suffix Tree Indexing. Phoophakdee and Zaki. Outline. Suffix Tree introduction Application in Bioinformatics Trellis Trellis performance Conclusion. Example Suffix Tree. Sequence ACGACG$ What are Suffix Links. Suffix tree runtime. Time complexity

ike
Download Presentation

Genome-scale Disk-based Suffix Tree Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome-scale Disk-based Suffix Tree Indexing Phoophakdee and Zaki

  2. Outline • Suffix Tree introduction • Application in Bioinformatics • Trellis • Trellis performance • Conclusion

  3. Example Suffix Tree • Sequence • ACGACG$ • What are Suffix Links

  4. Suffix tree runtime • Time complexity • Construction of suffix tree: • O(n) time and space where n is the size of the text being searched • Substring Search: • O(m) time where m is size of substring/search pattern • Knuth-Morris-Pratt and Boyer-Moore algorithm comparison

  5. Application in Bioinformatics • Database search • Exact matching • Approximate matching* • Longest common substring • Genome alignment* • Structural motifs* • Tandem repeats* • Sequence comparison

  6. Problems with Genome-scale suffix trees • Efficient O(n) suffix tree generating algorithms • Tree must fit entirely in main memory • e.g. Ukkonen’s algorithm • Genomes are very large • Human genome is 3 Gbp (0.75 GB) • Data structure no longer able to fit in memory

  7. What Trellis solves • Prevents data skew in prefix partitioning • Bad data skew with prefix partitioning leads to prefix partitions that may not fit into memory. • From non-uniform distribution of alphabit/DNA • Efficient disk-base implementation • Function under low memory constraints • Efficient disk IO usage • Able to recover suffix links

  8. Trellis Steps • Prefix Creation Phase • Partitioning Phase • Merging Phase • Suffix Link Recovery Phase (Optional)

  9. Trellis Overview

  10. Merging Phase

  11. Threshold (t) • Determines partition of sequence • Suffix subtree fits into memory during partitioning phase. • Determines cutoff for prefix set inclusion • Recombined prefixed suffix subtree will fit entirely into memory during merging phase. • Allows input string and two sets of internal nodes to fit entirely into memory during suffix link recovery phase

  12. Trellis Overview

  13. Performance • O(n2) time and O(n) space (where n is sequence length) • Comparison to TDD • Currently only other algorithm that scales up to genome level • Same time complexity • Does not calculate suffix links

  14. Suffix Tree Construction

  15. Query Times

  16. Query Times

  17. Conclusion • Efficient disk-based suffix tree generation that works well with limited memory • Suffix links are recoverable • Future work • Extend to larger alphabets • Buffer input sequence • Parallelize partitioning and merging

More Related