1 / 14

Suffix Array: Data structures and applications

Suffix Array: Data structures and applications. Zhifeng Liu School of Computer Science, Univ. of Waterloo Dec. 06, 2004. Outline. Introduction Suffix array and enhanced suffix array An example - Is P a substring of S ? Conclusions References. Introduction Why suffix array?.

calix
Download Presentation

Suffix Array: Data structures and applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Suffix Array:Data structures and applications Zhifeng Liu School of Computer Science, Univ. of Waterloo Dec. 06, 2004

  2. Outline • Introduction • Suffix array and enhanced suffix array • An example - Is P a substring of S? • Conclusions • References

  3. IntroductionWhy suffix array? • Suffix tree’s drawbacks : • Space consumption: 20n (n=|S|, string length) [Kur99] • Memory Locality: Loss of efficiency • Suffix array (PAT array): • Manber & Myers[Man93] • also Gonnet & Baeza-Yates[Gon93]

  4. Suffix arrayDefinition & an example • Informal Definition: • same information as a suffix tree but more compact. • Suffixes in an alphabetic order • Example: the suffix array for banana# is: # a# ana# anana# banana# na# nana# From Prof. Brown’s Assign 2’s Handout

  5. Suffix array isn’t perfect either • Less space: 4n but • direct constructing time: O(nlogn) • Linear constructing time via suffix tree but sacrifices space • Binary search for a substring P takes O(mlogn) (m=|P|) • So enhanced suffix array!

  6. 0-[0,10] 1-[0,5] 2-[6,7] 1-[8,9] 2-[0..1] 3-[2..3] 2-[4..5] Enhanced Suffix Array=suffix array+ additional tables Fig1 The enhanced array for S=acaaacatat$ and its lcp-interval tree Adapted from [Abo04]

  7. 0-[0,10] $ a ca t 1-[0,5] 2-[6,7] 1-[8,9] a..$ 10 t..$ a..$ $ 5 a 1 9 7 ca t a..$ t..$ c..$ a..$ a..$ t..$ 2-[0..1] 3-[2..3] 2-[4..5] 8 4 3 2 0 6 Enhanced Suffix Array(2) Fig2 The lcp-interval tree vs suffix tree for S=acaaacatat$

  8. 0-[0,10] 1-[0,5] 2-[6,7] 1-[8,9] 2-[0..1] 3-[2..3] 2-[4..5] Enhanced Suffix Array(3)-the more tables, the more likely a suffix tree? • ChildTab: Up, down and next fields to record the parent-child,sibling relationships. • The lcp-interval tree is like a suffix tree. However, it is virtualbut can simulate suffix tree traversalefficiently. Fig3 ChildTab records the linked relationship in the lcp-interval tree

  9. Enhanced suffix array replaces suffix tree • Every algorithms using suffix tree can be systematically replaced by (enhanced) suffix array in the same time complexity • Bottom-up traversal of suffix tree ->suffix array with lcptab and lcp-interval tree • Top-down traversal of suffix tree->suffix array with childtab Answer Decision Query

  10. Answer Decision Queries Algorithm Answering decision queries c := 0 queryFound := true (i, j ) := getInterval(0,n,P[c]) while (i, j ) <>⊥ and c<m and queryFound = True if i <> j then l := getlcp(i, j ) min := min{l, m} queryFound := S[suftab[i]+ c..suftab[i]+min − 1] = P[c..min− 1] c := min (i, j ) := getInterval(i, j,P[c]) else queryFound := S[suftab[i]+ c..suftab[i]+ m− 1] = P[c..m− 1] if queryFound then Report [i, j] as a occurrence of P else print(P is not found in S)

  11. Answer Decision Queries (cont’d) P=cb P=caaa Longest common string

  12. Additional tables eat too much space? There are tricks to reduce space requirements. • If string length n=|S| <232,each integer index needs 4 bytes. • suftab needs 4n; lcptab also needs 4n? • No! Usually only a few entries in lcptab >255. So • Store each entry in lcptab with 1 byte and allocate another table for long-lcp-values • Space saved, time efficiency reserved though the worst-case time complexity may be affected

  13. Conclusions • Suffix array: there is always a tension between space and speed. Research tries to release the tension; • Suffix array can replace with suffix tree; • Suffix array is practical: Faster and easier to implement

  14. References • [Abo04] Mohamed Ibrahim Abouelhoda , Stefan Kurtz , Enno Ohlebusch, Replacing suffix trees with enhanced suffix arrays, Journal of Discrete Algorithms, Volume 2,  Issue 1  (March 2004) p.54-86 • [Abo02A] Mohamed Ibrahim Abouelhoda , Stefan Kurtz , Enno Ohlebusch, The Enhanced Suffix Array and Its Applications to Genome Analysis, Proceedings of the Second International Workshop on Algorithms in Bioinformatics, September 17-21,2002, p.449-463 • [Abo02B] Mohamed Ibrahim Abouelhoda , Enno Ohlebusch , Stefan Kurtz, Optimal Exact Strring Matching Based on Suffix Arrays, Proceedings of the 9th International Symposium on String Processing and Information Retrieval, p.31-43, September 11-13, 2002 • [Gon92] Gaston H. Gonnet , Ricardo A. Baeza-Yates , Tim Snider, New indices for text: PAT Trees and PAT arrays, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992 • [Kur99] S. Kurtz, Reducing the space requirement of suffix trees, Software—Practice and Experience 29 (13) (1999) 1149–1171. • [Man93] Udi Manber , Gene Myers, Suffix arrays: a new method for on-line string searches, SIAM Journal on Computing, v.22 n.5, p.935-948, Oct. 1993

More Related