1 / 25

Fine Tuning the Enhanced Suffix Arrays

Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda. Fine Tuning the Enhanced Suffix Arrays. Table of Contents. Suffix array The enhanced suffix array Our accomplishment: Minimal Perfect Hashing Function The exact pattern matching problem

loan
Download Presentation

Fine Tuning the Enhanced Suffix Arrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AyatA.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Fine Tuning the Enhanced Suffix Arrays Ayat A.Dawood

  2. Table of Contents • Suffix array • The enhanced suffix array • Our accomplishment: • Minimal Perfect Hashing Function • The exact pattern matching problem • Improving the bucket table representation Ayat A.Dawood

  3. Suffix array • Array of integers in the range from 0to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$. • e.g., S = acaaacatat$ Ayat A.Dawood

  4. Suffix array • Array of integers in the range from 0to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$. • e.g., S = acaaacatat$ Ayat A.Dawood

  5. Enhanced suffix array • Basically it is the suffix array enhanced with a set of tables. • Using those tables, best performance and complexity are achieved • lcptab[i] stores the length of longest common prefix of the suffixes suftab[i] and suftab[i-1]. Ayat A.Dawood

  6. Enhanced suffix array: l-interval • L-interval: interval of suffixes sharing the same prefix 1-[0..5] AyatA.Dawood

  7. Enhanced suffix array: l-interval 1-[0..5] a 2-[0..1] • L-interval: interval of suffixes sharing the same prefix AyatA.Dawood

  8. Enhanced suffix array: l-interval • L-interval: interval of suffixes sharing the same prefix 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood

  9. Our accomplishment • Improvement (Fine Tuning): • Alphabet-independent exact pattern matching. • Improving bucket table representation • Improving access to the lcp-table. • Improvements are achieved using minimal perfect hashing techniques. Ayat A.Dawood

  10. Minimal perfect hashing(MPHF) • Storing n static keys from universe U in O(n) space with O(1) access time.[Botelho et. al] • Look up table requires O(|U|) space to achieve constant access time Ayat A.Dawood

  11. Exact pattern matching problem • e.g., pattern = aca 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood

  12. Exact pattern matching problem • e.g., pattern = aca 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood

  13. Exact pattern matching problem • e.g., pattern = aca 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood

  14. Exact pattern matching problem • e.g., pattern = aca 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood

  15. Exact pattern matching problem • Using normal method: takes O(nm) • Using the enhanced suffix arrays, it can be achieved in O(|∑|m) [AbouElHoda et. al] • Other modification to the enhanced suffix arrays allows it to be done in O(m log (|∑|)).[Kim et. al],[Fischer et. al] Ayat A.Dawood

  16. Exact pattern matching problem • Our work: • Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. 0-[0..10] MPHF table c t a 1-[8..9] 1-[0..5] 2-[6..7] MPHF table a c t 3-[2..3] 2-[0..1] 2-[4..5] Ayat A.Dawood

  17. Exact pattern matching problem • Our work: • Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] Ayat A.Dawood

  18. Exact pattern matching problem • Our work: • Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] Ayat A.Dawood

  19. Improving the bucket table representation • Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array Ayat A.Dawood

  20. Improving the bucket table representation • Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array Ayat A.Dawood

  21. Improving the bucket table representation cont’ • Problem: • Space consumption of the look up table is prohibitive for large d and ∑ (d ^ |∑|). • Solution: • Use minimal perfect hashing techniques to store the look up table. Ayat A.Dawood

  22. Improving the bucket table representation cont’ • Results: • For the bacterial ecoli genome (size = 5400 bp) and for d= 12 *N for undefined nucleotide or dummy character Ayat A.Dawood

  23. Conclusion • Exact pattern matching problem • Improving the bucket table representation. • Improving access to the lcp-table. Ayat A.Dawood

  24. Questions??? Ayat A.Dawood

  25. Improving access to the lcp-table • To reduce space, lcp- table is stored in 1 byte. • If a common prefix is longer than 255, then it is stored in another table. • To access this table, it is accessed sequential or using binary search • Our Enhancement: • Use MPHF to store the extra table to access it in constant time. lcp-table 0 Extra lcp-table 2 257 279 3 300 2 260 0 Ayat A.Dawood

More Related