a categorization theorem on suffix arrays with applications to space efficient text indexes
Download
Skip this Video
Download Presentation
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

Loading in 2 Seconds...

play fullscreen
1 / 17

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes - PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes. Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo. The Problem. Initial Problem Text searching : Finding occurrences of a pattern string in a large (static) document Solution

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes' - cole-ware


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a categorization theorem on suffix arrays with applications to space efficient text indexes

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

Meng He, J. Ian Munro,

and S. Srinivasa Rao

University of Waterloo

the problem
The Problem
  • Initial Problem
    • Text searching: Finding occurrences of a pattern string in a large (static) document
  • Solution
    • Text indexing: Trading space for time
  • New Problem
    • Succinct Text indexes: Reducing the space cost
pattern searching
Pattern Searching
  • Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T.
  • Three types of Queries
    • Existential queries: Does P occur in T?
    • Cardinality queries: How many times does P occur in T?
    • Listing queries: Where does P occur in T?
text indexing
Text Indexing
  • Inverted files
    • Word index
    • Need to store the text as well as the index
  • Suffix trees
    • Efficient full-text index
    • 4n lg nto6n lg nbits!
  • Suffix arrays
    • n lg n bits in basic form, but
    • 3n lg n bits (with LCP data)
applications
Applications
  • Text databases
    • electronic encyclopedias, dictionaries, books, etc.
  • Web search engines
    • Google, Altavista, etc.
  • Bioinformatics
    • gene databases
  • More…
related work
Related Work
  • Compressed Suffix Arrays
    • Grossi & Vitter 2000
    • Sadakane 2000
    • Grossi, Gupta & Vitter 2003
  • FM-index
    • Ferragina & Manzini 2000 & 2001
assumptions notation
Assumptions & Notation
  • Alphabet: Σ = {a, b}
  • Text: T[1..n]
    • T[n] = #, where a < # < b
  • Pattern: P[1..m]
permutations and suffix arrays
Permutations and Suffix Arrays
  • An observation
    • Permutations: n!
    • Suffix arrays: 2n-1
    • Not all permutations are suffix arrays
  • An example
    • A suffix array: 4, 7, 5, 1, 8, 3, 6, 2
      • Text: abbaaba#
    • A permutation: 4, 7, 1, 5, 8, 2, 3, 6
      • Not a suffix array of any binary text
two features of suffix arrays
Two Features of Suffix Arrays

Ascending-to-max

Non-nesting

Suffix Array

4 7 5 183 6 2

Another Permutation

4 7 1 582 3 6

a categorization theorem
A Categorization Theorem
  • A permutation is a suffix arrayiffit is:
    • Ascending-to-max
    • Non-nesting
  • An immediate application:
    • Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory.
application space efficient suffix array

0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1

Ba:

1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0

Bb:

Application: Space Efficient Suffix Array

8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14

SA:

Text: abaaabbaaabaabb#

basic searching algorithm answering cardinality queries

8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14

SA:

Basic Searching Algorithm:Answering Cardinality Queries

Basic Idea: backward search

  • Start from the end of the pattern P
  • For i = m, m-1, …, 1, compute the interval [s,e] of SA whose corresponding suffixes are prefixed with P[i, m]

P = aba

more algorithms and tradeoffs
More Algorithms and Tradeoffs
  • Answering listing queries
  • Speeding up the reporting of Occurrences of Long Patterns
  • Self-indexing
  • Time-space tradeoff: multi-level structure
putting it all together
Putting it all together

Three index structures:

conclusion
Conclusion
  • Summary
    • A theorem that characterizesa permutation as the suffix array of a binary string
    • An efficient algorithm checking whether a permutation is a suffix array
    • Three space efficient text indexing methods
conclusions continued
Conclusions (Continued)
  • Related subsequent work
    • Generalization to larger alphabets
  • Open problem
    • O(n)-bits text index supporting searching in O(m+occ) time.
ad