A categorization theorem on suffix arrays with applications to space efficient text indexes
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes PowerPoint PPT Presentation


  • 43 Views
  • Uploaded on
  • Presentation posted in: General

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes. Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo. The Problem. Initial Problem Text searching : Finding occurrences of a pattern string in a large (static) document Solution

Download Presentation

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A categorization theorem on suffix arrays with applications to space efficient text indexes

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes

Meng He, J. Ian Munro,

and S. Srinivasa Rao

University of Waterloo


The problem

The Problem

  • Initial Problem

    • Text searching: Finding occurrences of a pattern string in a large (static) document

  • Solution

    • Text indexing: Trading space for time

  • New Problem

    • Succinct Text indexes: Reducing the space cost


Pattern searching

Pattern Searching

  • Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T.

  • Three types of Queries

    • Existential queries: Does P occur in T?

    • Cardinality queries: How many times does P occur in T?

    • Listing queries: Where does P occur in T?


Text indexing

Text Indexing

  • Inverted files

    • Word index

    • Need to store the text as well as the index

  • Suffix trees

    • Efficient full-text index

    • 4n lg nto6n lg nbits!

  • Suffix arrays

    • n lg n bits in basic form, but

    • 3n lg n bits (with LCP data)


Applications

Applications

  • Text databases

    • electronic encyclopedias, dictionaries, books, etc.

  • Web search engines

    • Google, Altavista, etc.

  • Bioinformatics

    • gene databases

  • More…


Related work

Related Work

  • Compressed Suffix Arrays

    • Grossi & Vitter 2000

    • Sadakane 2000

    • Grossi, Gupta & Vitter 2003

  • FM-index

    • Ferragina & Manzini 2000 & 2001


Assumptions notation

Assumptions & Notation

  • Alphabet: Σ = {a, b}

  • Text: T[1..n]

    • T[n] = #, where a < # < b

  • Pattern: P[1..m]


Permutations and suffix arrays

Permutations and Suffix Arrays

  • An observation

    • Permutations: n!

    • Suffix arrays: 2n-1

    • Not all permutations are suffix arrays

  • An example

    • A suffix array: 4, 7, 5, 1, 8, 3, 6, 2

      • Text: abbaaba#

    • A permutation: 4, 7, 1, 5, 8, 2, 3, 6

      • Not a suffix array of any binary text


Two features of suffix arrays

Two Features of Suffix Arrays

Ascending-to-max

Non-nesting

Suffix Array

4 7 5 183 6 2

Another Permutation

4 7 1 582 3 6


A categorization theorem

A Categorization Theorem

  • A permutation is a suffix arrayiffit is:

    • Ascending-to-max

    • Non-nesting

  • An immediate application:

    • Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory.


Application space efficient suffix array

0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1

Ba:

1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0

Bb:

Application: Space Efficient Suffix Array

8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14

SA:

Text: abaaabbaaabaabb#


Basic searching algorithm answering cardinality queries

8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14

SA:

Basic Searching Algorithm:Answering Cardinality Queries

Basic Idea: backward search

  • Start from the end of the pattern P

  • For i = m, m-1, …, 1, compute the interval [s,e] of SA whose corresponding suffixes are prefixed with P[i, m]

P = aba


More algorithms and tradeoffs

More Algorithms and Tradeoffs

  • Answering listing queries

  • Speeding up the reporting of Occurrences of Long Patterns

  • Self-indexing

  • Time-space tradeoff: multi-level structure


Putting it all together

Putting it all together

Three index structures:


Conclusion

Conclusion

  • Summary

    • A theorem that characterizesa permutation as the suffix array of a binary string

    • An efficient algorithm checking whether a permutation is a suffix array

    • Three space efficient text indexing methods


Conclusions continued

Conclusions (Continued)

  • Related subsequent work

    • Generalization to larger alphabets

  • Open problem

    • O(n)-bits text index supporting searching in O(m+occ) time.


A categorization theorem on suffix arrays with applications to space efficient text indexes

Thank You.


  • Login