The SBC-Tree: An Index for Run-Length Compressed Sequences

The SBC-Tree: An Index for Run-Length Compressed Sequences Mohamed El-tabakh1, Wing-Kia Hon2 Rahul Shah3, Walid Aref1, Jeffrey Vitter1 1 Department of Computer Science, Purdue University 2 Department of Computer Science, National Tsing Hua University 3 Department of Computer Science, Louisiana State University

Outline • Introduction • Related Work • SBC-Tree Structure • SBC-Tree Operations • Theoretical and Experimental Analysis • Summary

Introduction: Why Compression? • We deal with massive amount of data, scientific databases, … • Text and sequence formats are very common • Compression techniques gain significant importance because they achieve: • Significant storage reduction • Reducing buffer requirements • Reducing number of I/Os >>> Enhance the overall system performance

Introduction: Objective • Current databases do not support data compression • Operate over the raw data compress Store, Index, and Search the compressed Sequences Store, Index, and Search the decompressed sequences • The main challenge is how to operate on the compressed data without decompressing it • More challenging for external memory processing

Processing Compressed Sequences: Related Work(1) Processing compressed data in main memory • A. Amir and G. Benson. Efficient two-dimensional compressed matching. In DCC, 1992. • A. Amir, G. Benson, and M. Farach. Let sleeping files lie: pattern matching in z-compressed files. In SODA, 1994. • A. Apostolico, G. M. Landau, and S. Skiena. Matching for run-length encoded strings. Journal of Complexity,1999. • T. Bell, M. Powell, A. Mukherjee, and D. Adjeroh. Searching BWT compressed text with the boyer-moore algorithm and binary search. In DCC, 2002. • V. Freschi and A. Bogliolo. Longest common subsequence between run-length-encoded strings: a new algorithm with improved parallelism. Information Processing Letters, 2004. • Searching compressed data is addressed in main memory • Substring matching, longest common subsequence, edit distance

20 Run-Length encoding 100 times (20, 100) 20 Processing Compressed Sequences: Related Work(2) Processing compressed data in DBMSs • M. Stonebraker, D. Abadi, A. Batkin, X. Chen,M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik, C-store: A column oriented dbms, In VLDB, 2005. • D. Abadi, S. Madden. Compression in Column Oriented Databases. In SIGMOD, 2006. Operations such as SUM can be applied directly over the compressed data • More complex operations have not been addressed yet • Indexing RLE-compressed sequences • Substring searching Column in a database table

What is SBC-Tree? • SBC-Tree (String B-tree for Compressed sequences) • An index for Run-Length Encoding (RLE) compressed sequences • Supports prefix, range, and substring matching • Optimal theoretical bounds for: • External memory space complexity • Search I/O requirements >> Relative to the size of the compressed sequences

RLE-char SBC-Tree: An Index for RLE Compressed Sequences S = LLLLLLLLLLEEEEEELLLLEEEHHHHHHHHHHHHHHHHHH RLE(S) = L10 E6 L4 E3 H18 >> S has 41 suffixes >> RLE(S) has 5 RLE-suffixes • Run-Length Encoding (RLE) • Replace tandem repeated characters with their frequency • Effective with small alphabets RLE-suffixes L10 E6 L4 E3 H18 E6 L4 E3 H18 L4 E3 H18 E3 H18 H18 • Store the compressed sequences • Index the RLE-suffixes • Perform efficiently substring operations

Preceding character Two-dimensional Index (e.g., R-tree) Tags String B-tree Numeric tag assigned to each suffix root SBC-tree Structure • Two-level index structure • String B-tree: Indexes the RLE-suffixes • Two-dimensional index: built on top of the leaves of the string B-tree

String B-tree Overview[P. Ferragina and R. Grossi., Journal of ACM,1999] • S = LLLLLLLLLLEEEEEELLLLEEEHHHHHHHHHHHHHHHHHH 11 21 • Generate all suffixes of S • Insert the suffixes into the String B-tree (ordered alphabetically) • Store the logical keys instead of the key sequence • Several optimizations to achieve optimal theoretical bounds for: •  External memory space complexity •  Search I/O requirements (S,11) (S,12) (S,13) (S,21) Store logical pointers instead of the keys • >> Relative to the size of the raw (decompressed) sequences

4 6 8 10 5 1 4 2 3 (S,1) (S,4) (S,8) (S,10) (S,6) Implicit in L10 E6 L4 E3 H18 Implicit in E6 L4 E3 H18 String B-tree over RLE-suffixes • String B-tree CANNOT be used directly to index RLE-suffixes • RLE-suffixes are subset of the total suffixes L10 E6 L4 E3 H18 Order L10 E6 L4 E3 H18 E6 L4 E3 H18 L4 E3 H18 E3 H18 H18 • We indexed only subset of the suffixes (RLE-suffixes) • Searching for “L10 E6 L3” Found • Searching for “L5 E6 L3”  Not Found • Searching for “E3 L4”  Not Found

RLE-suffixes L10 E6 L4 E3 H18 E6 L4 E3 H18 L4 E3 H18 E3 H18 H18 L5 E6 L3 L5 H2 Not part of the query answer L5 K10 L3 L6 E6 L3 SBC-Tree over RLE-suffixes • Query Pattern Mapping Rule: • Substring query pattern P = x1f1 x2f2 … xnfn is mapped into P’ = x1f1+ x2f2 … xnfn • Searching for “L5 E6 L3”  (L5+ E6 L3)  Found • Searching for “E3 L4” (E3+ L4)  Found Challenge: The answer set is no longer consecutive in the index tree  Unbounded number of I/Os to answer a query L5+ E6 L3

SBC-tree: Insertion Procedure • Given an RLE sequence S = Ω1x1f1 x2f2 … xnfn • Insert S as the first suffix into the SBC-tree first level • 1 ≤ i ≤ n, insert RLE-suffix xi1 xi+1fi+1 … xnfninto the SBC-tree first level • Assign it a position tag T (Tag assignment problem) • Insert into the SBC-tree second level point = (T, fi)

The answer set Preceding RLE-char f1 Suffix tag Max_tag Min_tag SBC-tree: Substring Searching • Given a query Q = y1f1 y2f2 … ymfm • Map Q into Q’ = y1f1+ y2f2 … ymfm • Search the String B-tree for Q’’ = y11 y2f2 … ymfm • Returns (min_tag, max_tag) as a contiguous range • Search the SBC-tree second level for suffixes with frequency >= f1 Two-dimensional index String B-tree

SBC-Tree: Example P = A5 E3 B4 P’ = A5+ E3 B4 P’’ = A1 E3 B4

SBC-Tree Variants • 3-sided structure[L. Arge, V. Samoladas, J. Vitter, PODS99] • External memory structure based on priority search tree and B-tree • Answers 3-sided range queries in 2D space • Provides optimal worst-case theoretical bounds for: • External memory space complexity • Insertion and deletion • 3-sided range query • R-tree • Available in all DBMSs • Provides good performance in practice • Does not have worst-case theoretical bounds for searching • One-Level SBC-tree • Remove the second level structure • Disadvantage: In queries scan many tuples outside the answer set

SBC-tree Theoretical Bounds • Optimal external-memory space complexity O(N/B) • Optimal substring, prefix, and range searching in O(LogBN + (|p| +T)/B) I/O operations • Insertion and deletion in (m LogB(N+m)) amortized I/O operations

SBC-tree Implementation • SBC-tree (R-tree variant) is implemented inside PostgreSQL • Query operators: • ^^ (substring search) • @@ (Prefix search) • ==<< ==>> (Range search) CREATE TABLE sequences (id INT, RLE_seq VARCHAR); CREATE INDEX ON sequences USING sbctree (seq); SELECT id FROM sequences WHERE RLE_seq ^^ ‘A5H7N2’; Substring searching operator

SBC-tree Performance Analysis: Storage Requirements Comparing SBC-tree performance relative to String B-tree over uncompressed sequences Datasets SwissProt (Protein secondary structure) alphabet size = 3 WalMart (Sales profile time series) alphabet size = 5 Temperature (Time series of sensor readings) alphabet size = 52 Up to an order of magnitude saving in storage

SBC-tree Performance Analysis:Insertion Comparing SBC-tree performance relative to String B-tree over uncompressed sequences Datasets SwissProt (Protein secondary structure) alphabet size = 3 WalMart (Sales profile time series) alphabet size = 5 Temperature (Time series of sensor readings) alphabet size = 52 Around 30% saving in Insertion

SBC-tree Performance Analysis:Searching Comparing SBC-tree performance relative to String B-tree over uncompressed sequences Datasets SwissProt (Protein secondary structure) alphabet size = 3 WalMart (Sales profile time series) alphabet size = 5 Temperature (Time series of sensor readings) alphabet size = 52 • Retain the optimal search performance (only the query answer is retrieved) • Some additional overhead because of the two-level structure

Summary • Addressing the challenge of storing and operating on compressed data inside DBMSs without decompression • Introduced the SBC-tree as an index for Run-Length Encoded (RLE) compressed sequences • SBC-Tree has optimal theoretical bounds for: • External memory space complexity • Search I/O requirements • Implementation inside PostgreSQL 22

Thank you Mohamed Eltabakh (meltabak@cs.purdue.edu)

The SBC-Tree: An Index for Run-Length Compressed Sequences

The SBC-Tree: An Index for Run-Length Compressed Sequences

Presentation Transcript

The SBC Transformation

Run Length Encoder/Decoder

Compressed Suffix Arrays based on Run-Length Encoding

B-Tree Index

Database Index to Large Biological Sequences

Run-Length Encoding for Texture Classification

Index-based search of single sequences

Compressed Index for Dictionary Matching

P2PR-tree: An R-tree-based Spatial Index for P2P Environments

B + -Tree Index Files

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Approximate Matching of Run-Length Compressed Strings

An environment for the long run!

Compressed Index for a Dynamic Collection of Texts

Get Longest Run Index (FR)

B+-Tree Index

“An Arms Length?”

Data Coding Run Length Coding

B + -Tree Index Files