A categorization theorem on suffix arrays with applications to space efficient text indexes
Download
1 / 17

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes - PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes. Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo. The Problem. Initial Problem Text searching : Finding occurrences of a pattern string in a large (static) document Solution

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes' - cole-ware


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A categorization theorem on suffix arrays with applications to space efficient text indexes

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

Meng He, J. Ian Munro,

and S. Srinivasa Rao

University of Waterloo


The problem
The Problem to Space Efficient Text Indexes

  • Initial Problem

    • Text searching: Finding occurrences of a pattern string in a large (static) document

  • Solution

    • Text indexing: Trading space for time

  • New Problem

    • Succinct Text indexes: Reducing the space cost


Pattern searching
Pattern Searching to Space Efficient Text Indexes

  • Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T.

  • Three types of Queries

    • Existential queries: Does P occur in T?

    • Cardinality queries: How many times does P occur in T?

    • Listing queries: Where does P occur in T?


Text indexing
Text Indexing to Space Efficient Text Indexes

  • Inverted files

    • Word index

    • Need to store the text as well as the index

  • Suffix trees

    • Efficient full-text index

    • 4n lg nto6n lg nbits!

  • Suffix arrays

    • n lg n bits in basic form, but

    • 3n lg n bits (with LCP data)


Applications
Applications to Space Efficient Text Indexes

  • Text databases

    • electronic encyclopedias, dictionaries, books, etc.

  • Web search engines

    • Google, Altavista, etc.

  • Bioinformatics

    • gene databases

  • More…


Related work
Related Work to Space Efficient Text Indexes

  • Compressed Suffix Arrays

    • Grossi & Vitter 2000

    • Sadakane 2000

    • Grossi, Gupta & Vitter 2003

  • FM-index

    • Ferragina & Manzini 2000 & 2001


Assumptions notation
Assumptions & Notation to Space Efficient Text Indexes

  • Alphabet: Σ = {a, b}

  • Text: T[1..n]

    • T[n] = #, where a < # < b

  • Pattern: P[1..m]


Permutations and suffix arrays
Permutations and Suffix Arrays to Space Efficient Text Indexes

  • An observation

    • Permutations: n!

    • Suffix arrays: 2n-1

    • Not all permutations are suffix arrays

  • An example

    • A suffix array: 4, 7, 5, 1, 8, 3, 6, 2

      • Text: abbaaba#

    • A permutation: 4, 7, 1, 5, 8, 2, 3, 6

      • Not a suffix array of any binary text


Two features of suffix arrays
Two Features of Suffix Arrays to Space Efficient Text Indexes

Ascending-to-max

Non-nesting

Suffix Array

4 7 5 183 6 2

Another Permutation

4 7 1 582 3 6


A categorization theorem
A Categorization Theorem to Space Efficient Text Indexes

  • A permutation is a suffix arrayiffit is:

    • Ascending-to-max

    • Non-nesting

  • An immediate application:

    • Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory.


Application space efficient suffix array

0 0 1 1 0 to Space Efficient Text Indexes0 1 1 1 0 0 1 1 0 1 1

Ba:

1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0

Bb:

Application: Space Efficient Suffix Array

8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14

SA:

Text: abaaabbaaabaabb#


Basic searching algorithm answering cardinality queries

8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 to Space Efficient Text Indexes

SA:

Basic Searching Algorithm:Answering Cardinality Queries

Basic Idea: backward search

  • Start from the end of the pattern P

  • For i = m, m-1, …, 1, compute the interval [s,e] of SA whose corresponding suffixes are prefixed with P[i, m]

P = aba


More algorithms and tradeoffs
More Algorithms and Tradeoffs to Space Efficient Text Indexes

  • Answering listing queries

  • Speeding up the reporting of Occurrences of Long Patterns

  • Self-indexing

  • Time-space tradeoff: multi-level structure


Putting it all together
Putting it all together to Space Efficient Text Indexes

Three index structures:


Conclusion
Conclusion to Space Efficient Text Indexes

  • Summary

    • A theorem that characterizesa permutation as the suffix array of a binary string

    • An efficient algorithm checking whether a permutation is a suffix array

    • Three space efficient text indexing methods


Conclusions continued
Conclusions (Continued) to Space Efficient Text Indexes

  • Related subsequent work

    • Generalization to larger alphabets

  • Open problem

    • O(n)-bits text index supporting searching in O(m+occ) time.


Thank You. to Space Efficient Text Indexes


ad