# A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes - PowerPoint PPT Presentation

1 / 17

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes. Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo. The Problem. Initial Problem Text searching : Finding occurrences of a pattern string in a large (static) document Solution

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

### Download Presentation

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

Meng He, J. Ian Munro,

and S. Srinivasa Rao

University of Waterloo

### The Problem

• Initial Problem

• Text searching: Finding occurrences of a pattern string in a large (static) document

• Solution

• Text indexing: Trading space for time

• New Problem

• Succinct Text indexes: Reducing the space cost

### Pattern Searching

• Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T.

• Three types of Queries

• Existential queries: Does P occur in T?

• Cardinality queries: How many times does P occur in T?

• Listing queries: Where does P occur in T?

### Text Indexing

• Inverted files

• Word index

• Need to store the text as well as the index

• Suffix trees

• Efficient full-text index

• 4n lg nto6n lg nbits!

• Suffix arrays

• n lg n bits in basic form, but

• 3n lg n bits (with LCP data)

### Applications

• Text databases

• electronic encyclopedias, dictionaries, books, etc.

• Web search engines

• Google, Altavista, etc.

• Bioinformatics

• gene databases

• More…

### Related Work

• Compressed Suffix Arrays

• Grossi & Vitter 2000

• Sadakane 2000

• Grossi, Gupta & Vitter 2003

• FM-index

• Ferragina & Manzini 2000 & 2001

### Assumptions & Notation

• Alphabet: Σ = {a, b}

• Text: T[1..n]

• T[n] = #, where a < # < b

• Pattern: P[1..m]

### Permutations and Suffix Arrays

• An observation

• Permutations: n!

• Suffix arrays: 2n-1

• Not all permutations are suffix arrays

• An example

• A suffix array: 4, 7, 5, 1, 8, 3, 6, 2

• Text: abbaaba#

• A permutation: 4, 7, 1, 5, 8, 2, 3, 6

• Not a suffix array of any binary text

### Two Features of Suffix Arrays

Ascending-to-max

Non-nesting

Suffix Array

4 7 5 183 6 2

Another Permutation

4 7 1 582 3 6

### A Categorization Theorem

• A permutation is a suffix arrayiffit is:

• Ascending-to-max

• Non-nesting

• An immediate application:

• Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory.

0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1

Ba:

1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0

Bb:

### Application: Space Efficient Suffix Array

8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14

SA:

Text: abaaabbaaabaabb#

8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14

SA:

### Basic Searching Algorithm:Answering Cardinality Queries

Basic Idea: backward search

• Start from the end of the pattern P

• For i = m, m-1, …, 1, compute the interval [s,e] of SA whose corresponding suffixes are prefixed with P[i, m]

P = aba

### More Algorithms and Tradeoffs

• Answering listing queries

• Speeding up the reporting of Occurrences of Long Patterns

• Self-indexing

• Time-space tradeoff: multi-level structure

### Putting it all together

Three index structures:

### Conclusion

• Summary

• A theorem that characterizesa permutation as the suffix array of a binary string

• An efficient algorithm checking whether a permutation is a suffix array

• Three space efficient text indexing methods

### Conclusions (Continued)

• Related subsequent work

• Generalization to larger alphabets

• Open problem

• O(n)-bits text index supporting searching in O(m+occ) time.

Thank You.