Loading in 5 sec....

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text IndexesPowerPoint Presentation

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

- 74 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes' - cole-ware

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

Meng He, J. Ian Munro,

and S. Srinivasa Rao

University of Waterloo

The Problem to Space Efficient Text Indexes

- Initial Problem
- Text searching: Finding occurrences of a pattern string in a large (static) document

- Solution
- Text indexing: Trading space for time

- New Problem
- Succinct Text indexes: Reducing the space cost

Pattern Searching to Space Efficient Text Indexes

- Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T.
- Three types of Queries
- Existential queries: Does P occur in T?
- Cardinality queries: How many times does P occur in T?
- Listing queries: Where does P occur in T?

Text Indexing to Space Efficient Text Indexes

- Inverted files
- Word index
- Need to store the text as well as the index

- Suffix trees
- Efficient full-text index
- 4n lg nto6n lg nbits!

- Suffix arrays
- n lg n bits in basic form, but
- 3n lg n bits (with LCP data)

Applications to Space Efficient Text Indexes

- Text databases
- electronic encyclopedias, dictionaries, books, etc.

- Web search engines
- Google, Altavista, etc.

- Bioinformatics
- gene databases

- More…

Related Work to Space Efficient Text Indexes

- Compressed Suffix Arrays
- Grossi & Vitter 2000
- Sadakane 2000
- Grossi, Gupta & Vitter 2003

- FM-index
- Ferragina & Manzini 2000 & 2001

Assumptions & Notation to Space Efficient Text Indexes

- Alphabet: Σ = {a, b}
- Text: T[1..n]
- T[n] = #, where a < # < b

- Pattern: P[1..m]

Permutations and Suffix Arrays to Space Efficient Text Indexes

- An observation
- Permutations: n!
- Suffix arrays: 2n-1
- Not all permutations are suffix arrays

- An example
- A suffix array: 4, 7, 5, 1, 8, 3, 6, 2
- Text: abbaaba#

- A permutation: 4, 7, 1, 5, 8, 2, 3, 6
- Not a suffix array of any binary text

- A suffix array: 4, 7, 5, 1, 8, 3, 6, 2

Two Features of Suffix Arrays to Space Efficient Text Indexes

Ascending-to-max

Non-nesting

Suffix Array

4 7 5 183 6 2

Another Permutation

4 7 1 582 3 6

A Categorization Theorem to Space Efficient Text Indexes

- A permutation is a suffix arrayiffit is:
- Ascending-to-max
- Non-nesting

- An immediate application:
- Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory.

0 0 1 1 0 to Space Efficient Text Indexes0 1 1 1 0 0 1 1 0 1 1

Ba:

1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0

Bb:

Application: Space Efficient Suffix Array8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14

SA:

Text: abaaabbaaabaabb#

8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 to Space Efficient Text Indexes

SA:

Basic Searching Algorithm:Answering Cardinality QueriesBasic Idea: backward search

- Start from the end of the pattern P
- For i = m, m-1, …, 1, compute the interval [s,e] of SA whose corresponding suffixes are prefixed with P[i, m]

P = aba

More Algorithms and Tradeoffs to Space Efficient Text Indexes

- Answering listing queries
- Speeding up the reporting of Occurrences of Long Patterns
- Self-indexing
- Time-space tradeoff: multi-level structure

Putting it all together to Space Efficient Text Indexes

Three index structures:

Conclusion to Space Efficient Text Indexes

- Summary
- A theorem that characterizesa permutation as the suffix array of a binary string
- An efficient algorithm checking whether a permutation is a suffix array
- Three space efficient text indexing methods

Conclusions (Continued) to Space Efficient Text Indexes

- Related subsequent work
- Generalization to larger alphabets

- Open problem
- O(n)-bits text index supporting searching in O(m+occ) time.

Thank You. to Space Efficient Text Indexes

Download Presentation

Connecting to Server..