A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

1 / 17

# A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes - PowerPoint PPT Presentation

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes. Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo. The Problem. Initial Problem Text searching : Finding occurrences of a pattern string in a large (static) document Solution

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes' - cole-ware

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

Meng He, J. Ian Munro,

and S. Srinivasa Rao

University of Waterloo

The Problem
• Initial Problem
• Text searching: Finding occurrences of a pattern string in a large (static) document
• Solution
• Text indexing: Trading space for time
• New Problem
• Succinct Text indexes: Reducing the space cost
Pattern Searching
• Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T.
• Three types of Queries
• Existential queries: Does P occur in T?
• Cardinality queries: How many times does P occur in T?
• Listing queries: Where does P occur in T?
Text Indexing
• Inverted files
• Word index
• Need to store the text as well as the index
• Suffix trees
• Efficient full-text index
• 4n lg nto6n lg nbits!
• Suffix arrays
• n lg n bits in basic form, but
• 3n lg n bits (with LCP data)
Applications
• Text databases
• electronic encyclopedias, dictionaries, books, etc.
• Web search engines
• Bioinformatics
• gene databases
• More…
Related Work
• Compressed Suffix Arrays
• Grossi & Vitter 2000
• Grossi, Gupta & Vitter 2003
• FM-index
• Ferragina & Manzini 2000 & 2001
Assumptions & Notation
• Alphabet: Σ = {a, b}
• Text: T[1..n]
• T[n] = #, where a < # < b
• Pattern: P[1..m]
Permutations and Suffix Arrays
• An observation
• Permutations: n!
• Suffix arrays: 2n-1
• Not all permutations are suffix arrays
• An example
• A suffix array: 4, 7, 5, 1, 8, 3, 6, 2
• Text: abbaaba#
• A permutation: 4, 7, 1, 5, 8, 2, 3, 6
• Not a suffix array of any binary text
Two Features of Suffix Arrays

Ascending-to-max

Non-nesting

Suffix Array

4 7 5 183 6 2

Another Permutation

4 7 1 582 3 6

A Categorization Theorem
• A permutation is a suffix arrayiffit is:
• Ascending-to-max
• Non-nesting
• An immediate application:
• Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory.

Ba:

1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0

Bb:

Application: Space Efficient Suffix Array

8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14

SA:

Text: abaaabbaaabaabb#

8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14

SA:

Basic Idea: backward search

• Start from the end of the pattern P
• For i = m, m-1, …, 1, compute the interval [s,e] of SA whose corresponding suffixes are prefixed with P[i, m]

P = aba

• Speeding up the reporting of Occurrences of Long Patterns
• Self-indexing
Putting it all together

Three index structures:

Conclusion
• Summary
• A theorem that characterizesa permutation as the suffix array of a binary string
• An efficient algorithm checking whether a permutation is a suffix array
• Three space efficient text indexing methods
Conclusions (Continued)
• Related subsequent work
• Generalization to larger alphabets
• Open problem
• O(n)-bits text index supporting searching in O(m+occ) time.