1 / 26

Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

Linear-time construction of CSA using o( n log n )-bit working space for large alphabets. Joong Chae Na School of Computer Sci. & Eng. Seoul National University, Korea. Overview. Background Suffix arrays(SA) Compressed suffix arrays (CSA) Problem definition Previous works

archer
Download Presentation

Linear-time construction of CSA using o( n log n )-bit working space for large alphabets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear-time construction of CSAusing o(nlogn)-bit working spacefor large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University, Korea

  2. Overview • Background • Suffix arrays(SA) • Compressed suffix arrays (CSA) • Problem definition • Previous works • Our contributions • Description of our algorithm • Conclusions

  3. Background (1) Given a string T of length n over an alphabet Σ, • Suffix array (SA) of T[Manber&Myers ’93] • Lexicographically sorted list of the suffixes of T T : b a b a a b b a $ O(nlog n)-bits

  4. Background (2) • Compressed suffix array (CSA) [Grossi&Vitter ’00] • Compressed version of SA • Space requirement of O(nlog|Σ|)-bit • FM-index [Ferragina&Manzini 2000] O(nlog |Σ|)-bits T : b a b a a b b a $

  5. Problem definition • Constructing SA, CSA and FM-index using • o(nlog n)-time and • o(nlog n)-bitworking space • Working space • Temporary space required for executing an algorithm • Not including the space for the input and output

  6. Related works • Constructing SA and CSA ※ O(n log n)-bit working space • Manber & Myers [1993] : O(nlogn)-time • Kim et al. [2003] : O(n)-time • Kärkkäinen & Sanders [2003] : O(n)-time • Ko & Aluru [2003]: O(n)-time ※ O(n log |Σ| )-bit working space • Lam et al. [COCOON 2002]: O(|Σ|n log n )-time • Hon et al. [ISAAC 2003]: O(n log n )-time • None of these algorithms satisfy both time and space requirement of our problem.

  7. Previous results • Hon et al. [FOCS 2003] • An algorithm using O(n loglog|Σ|)-time and O(n log|Σ|)-bit working space • The first algorithm using o(nlog n)-time and o(nlog n)-bit working space • following ½-recursion (the odd-even scheme)

  8. Our contributions • Another algorithm using o(nlog n)-time and o(nlog n)-bit working space • O(n)-time and O(nlog|Σ|·log|Σ|αn)-bit working space • α = log3 2 ≈ 0.63 • The first alphabet-independent linear-time algorithm for constructing SA, CSA, and FM-index using o(nlog n)-bit working space • Following ⅔-recursion (the skew scheme)

  9. Hon et al. vs. Our results *The encoding step is the most complex and time-consuming step in 2/3-recursion. However, both algorithms don’t need the encoding step.

  10. Description of our algorithm

  11. Overview • Preliminaries • Basic definitions and notations • Main technique • Outline of our algorithm

  12. Preliminaries-Ψ function T[k..n] : lexicographically the ith smallest suffix of T ■SA[i] = k ■ The position in SA where T[k+1..n] is stored 1 2 3 4 5 6 7 8 9 T : b a b a a b b a $

  13. Preliminaries-Lemmas Hon et al. [FOCS 2003] • Text, Ψ → SA, CSA • O(n) time, O(n log|Σ|)-bit working space • Text, Ψ → C array (BWT) → FM-index • O(n) time, O(n log|Σ|)-bit working space • Note : goal • Text → Ψ

  14. Basic def. and not. (1) • Residue-1 suffixes of T • T[3i-2..n] for 1 ≤ i ≤ n/3 • T[1..n], T[4..n], T[7..n],… • Residue-2 suffixes of T • T[3i-1..n] for 1 ≤ i ≤ n/3 • T[2..n], T[5..n], T[8..n],… • Residue-3 suffixes of T • T[3i..n] for 1 ≤ i ≤ n/3 • T[3..n], T[6..n], T[9..n],…

  15. length : ⅔ n alphabet : Σ3 SA12 : suffix array of T12 length : ⅓ n alphabet : Σ3 SA3 : suffix array of T3 Basic def. and not. (2) alphabet Σ T12 [1..⅔n] = T[1..n]T[2..n]T[1] T3 [1.. ⅓n] = T[3..n]T[1]T[2]

  16. Main technique–Ψ’ function • Ψ’ is just like Ψ, but Ψ’ is defined in SA12and SA3 • Ψ’ points to the position in SA12or SA3 where T[k+1..n] (the next suffix of current suffix T[k..n]) is stored. ※Note that Ψ’ is not the Ψ-function of T12 and T3. • Ψ’-functionconsists of Ψ’T12, and Ψ’T3

  17. Ψ’ function (residue-1) • Ψ’T12 (residue-1 suffixes of T) • Let T[3k-2..n] be a suffix stored in SA12[i]. • Then, Ψ’T12[i] is the position in SA12 where the next suffix T[3k-1..n] is stored. • Ψ’T12 (residue-2 suffixes of T) Let T[3k-1..n] be a suffix stored in SA12[i]. Then, Ψ’T12[i] is the position in SA3 where the next suffix T[3k..n] is stored. • Ψ’T3 (residue-3 suffixes of T) Let T[3k..n] be a suffix stored in SA3[i]. Then, Ψ’T3[i] is the position in SA12 where the next suffix T[3k+1..n] is stored.

  18. Ψ’ function (residue-1)

  19. Ψ’ function (residue-2) • Ψ’T12 (residue-1 suffixes) Let T[3k-2..n] be a suffix stored in SA12[i]. Then, Ψ’T12[i] is the position in SA12 where the next suffix T[3k-1..n] is stored. • Ψ’T12 (residue-2 suffixes) • Let T[3k-1..n] be a suffix stored in SA12[i]. • Then, Ψ’T12 [i] is the position in SA3 where the next suffix T[3k..n] is stored. • Ψ’T3 (residue-3 suffixes) Let T[3k..n] be a suffix stored in SA3[i]. Then, Ψ’T3[i] is the position in SA12 where the next suffix T[3k+1..n] is stored.

  20. Ψ’ function (residue-2)

  21. Ψ’ function (residue-3) • Ψ’T12 (residue-1 suffixes) Let T[3k-2..n] be a suffix stored in SA12[i]. Then, Ψ’T12[i] is the position in SA12 where the next suffix T[3k-1..n] is stored. • Ψ’T12 (residue-2 suffixes) Let T[3k-1..n] be a suffix stored in SA12[i]. Then, Ψ’T12 [i] is the position in SA3 where the next suffix T[3k..n] is stored. • Ψ’T3 (residue-3 suffixes) • Let T[3k..n] be a suffix stored in SA3[i]. • Then, Ψ’T3[i] is the position in SA12 where the next suffix T[3k+1..n] is stored.

  22. Ψ’ function (residue-3)

  23. Framework- outline • How to construct Ψ function of T • Bottom-up approach length alphabet step 0 T ΨT step 1 T12ΨT12 … … step i step h Ψ h = log3log|Σ|n Use any linear time construction algorithm

  24. ΨS merge → Ψ’S12 Ψ’S3 Ψ’S12 ΨS Ψ’S3 Step i - outline S S3 S12ΨS12 ΨS12 (from step i+1)

  25. Merging step * Comparing entries of SA12 with entries of SA3 in order - compare two suffixes by following Ψ’-functoin at most twice

  26. Conclusions & future works • We presented an alphabet-independent linear-time algorithm to construct SA, CSA, FM-index using o(nlog n)-bit working space • Future works • To Construct SA, CSA, and FM-index optimally, i.e., using O(n)-time andO(n log|Σ|)-bit working space

More Related