1 / 17

# A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes - PowerPoint PPT Presentation

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes. Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo. The Problem. Initial Problem Text searching : Finding occurrences of a pattern string in a large (static) document Solution

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes' - cole-ware

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

Meng He, J. Ian Munro,

and S. Srinivasa Rao

University of Waterloo

The Problem to Space Efficient Text Indexes

• Initial Problem

• Text searching: Finding occurrences of a pattern string in a large (static) document

• Solution

• Text indexing: Trading space for time

• New Problem

• Succinct Text indexes: Reducing the space cost

Pattern Searching to Space Efficient Text Indexes

• Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T.

• Three types of Queries

• Existential queries: Does P occur in T?

• Cardinality queries: How many times does P occur in T?

• Listing queries: Where does P occur in T?

Text Indexing to Space Efficient Text Indexes

• Inverted files

• Word index

• Need to store the text as well as the index

• Suffix trees

• Efficient full-text index

• 4n lg nto6n lg nbits!

• Suffix arrays

• n lg n bits in basic form, but

• 3n lg n bits (with LCP data)

Applications to Space Efficient Text Indexes

• Text databases

• electronic encyclopedias, dictionaries, books, etc.

• Web search engines

• Bioinformatics

• gene databases

• More…

Related Work to Space Efficient Text Indexes

• Compressed Suffix Arrays

• Grossi & Vitter 2000

• Grossi, Gupta & Vitter 2003

• FM-index

• Ferragina & Manzini 2000 & 2001

Assumptions & Notation to Space Efficient Text Indexes

• Alphabet: Σ = {a, b}

• Text: T[1..n]

• T[n] = #, where a < # < b

• Pattern: P[1..m]

Permutations and Suffix Arrays to Space Efficient Text Indexes

• An observation

• Permutations: n!

• Suffix arrays: 2n-1

• Not all permutations are suffix arrays

• An example

• A suffix array: 4, 7, 5, 1, 8, 3, 6, 2

• Text: abbaaba#

• A permutation: 4, 7, 1, 5, 8, 2, 3, 6

• Not a suffix array of any binary text

Two Features of Suffix Arrays to Space Efficient Text Indexes

Ascending-to-max

Non-nesting

Suffix Array

4 7 5 183 6 2

Another Permutation

4 7 1 582 3 6

A Categorization Theorem to Space Efficient Text Indexes

• A permutation is a suffix arrayiffit is:

• Ascending-to-max

• Non-nesting

• An immediate application:

• Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory.

0 0 1 1 0 to Space Efficient Text Indexes0 1 1 1 0 0 1 1 0 1 1

Ba:

1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0

Bb:

Application: Space Efficient Suffix Array

8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14

SA:

Text: abaaabbaaabaabb#

8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 to Space Efficient Text Indexes

SA:

Basic Idea: backward search

• Start from the end of the pattern P

• For i = m, m-1, …, 1, compute the interval [s,e] of SA whose corresponding suffixes are prefixed with P[i, m]

P = aba

More Algorithms and Tradeoffs to Space Efficient Text Indexes

• Speeding up the reporting of Occurrences of Long Patterns

• Self-indexing

Putting it all together to Space Efficient Text Indexes

Three index structures:

Conclusion to Space Efficient Text Indexes

• Summary

• A theorem that characterizesa permutation as the suffix array of a binary string

• An efficient algorithm checking whether a permutation is a suffix array

• Three space efficient text indexing methods

Conclusions (Continued) to Space Efficient Text Indexes

• Related subsequent work

• Generalization to larger alphabets

• Open problem

• O(n)-bits text index supporting searching in O(m+occ) time.

Thank You. to Space Efficient Text Indexes