1 / 23

The SBC-Tree: An Index for Run-Length Compressed Sequences

The SBC-Tree: An Index for Run-Length Compressed Sequences. Mohamed El-tabakh 1 , Wing-Kia Hon 2 Rahul Shah 3 , Walid Aref 1 , Jeffrey Vitter 1 1 Department of Computer Science, Purdue University 2 Department of Computer Science, National Tsing Hua University

Download Presentation

The SBC-Tree: An Index for Run-Length Compressed Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The SBC-Tree: An Index for Run-Length Compressed Sequences Mohamed El-tabakh1, Wing-Kia Hon2 Rahul Shah3, Walid Aref1, Jeffrey Vitter1 1 Department of Computer Science, Purdue University 2 Department of Computer Science, National Tsing Hua University 3 Department of Computer Science, Louisiana State University

  2. Outline • Introduction • Related Work • SBC-Tree Structure • SBC-Tree Operations • Theoretical and Experimental Analysis • Summary

  3. Introduction: Why Compression? • We deal with massive amount of data, scientific databases, … • Text and sequence formats are very common • Compression techniques gain significant importance because they achieve: • Significant storage reduction • Reducing buffer requirements • Reducing number of I/Os >>> Enhance the overall system performance

  4. Introduction: Objective • Current databases do not support data compression • Operate over the raw data compress Store, Index, and Search the compressed Sequences Store, Index, and Search the decompressed sequences • The main challenge is how to operate on the compressed data without decompressing it • More challenging for external memory processing

  5. Processing Compressed Sequences: Related Work(1) Processing compressed data in main memory • A. Amir and G. Benson. Efficient two-dimensional compressed matching. In DCC, 1992. • A. Amir, G. Benson, and M. Farach. Let sleeping files lie: pattern matching in z-compressed files. In SODA, 1994. • A. Apostolico, G. M. Landau, and S. Skiena. Matching for run-length encoded strings. Journal of Complexity,1999. • T. Bell, M. Powell, A. Mukherjee, and D. Adjeroh. Searching BWT compressed text with the boyer-moore algorithm and binary search. In DCC, 2002. • V. Freschi and A. Bogliolo. Longest common subsequence between run-length-encoded strings: a new algorithm with improved parallelism. Information Processing Letters, 2004. • Searching compressed data is addressed in main memory • Substring matching, longest common subsequence, edit distance

  6. 20 Run-Length encoding 100 times (20, 100) 20 Processing Compressed Sequences: Related Work(2) Processing compressed data in DBMSs • M. Stonebraker, D. Abadi, A. Batkin, X. Chen,M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik, C-store: A column oriented dbms, In VLDB, 2005. • D. Abadi, S. Madden. Compression in Column Oriented Databases. In SIGMOD, 2006. Operations such as SUM can be applied directly over the compressed data • More complex operations have not been addressed yet • Indexing RLE-compressed sequences • Substring searching Column in a database table

  7. What is SBC-Tree? • SBC-Tree (String B-tree for Compressed sequences) • An index for Run-Length Encoding (RLE) compressed sequences • Supports prefix, range, and substring matching • Optimal theoretical bounds for: • External memory space complexity • Search I/O requirements >> Relative to the size of the compressed sequences

  8. RLE-char SBC-Tree: An Index for RLE Compressed Sequences S = LLLLLLLLLLEEEEEELLLLEEEHHHHHHHHHHHHHHHHHH RLE(S) = L10 E6 L4 E3 H18 >> S has 41 suffixes >> RLE(S) has 5 RLE-suffixes • Run-Length Encoding (RLE) • Replace tandem repeated characters with their frequency • Effective with small alphabets RLE-suffixes L10 E6 L4 E3 H18 E6 L4 E3 H18 L4 E3 H18 E3 H18 H18 • Store the compressed sequences • Index the RLE-suffixes • Perform efficiently substring operations

  9. Preceding character Two-dimensional Index (e.g., R-tree) Tags String B-tree Numeric tag assigned to each suffix root SBC-tree Structure • Two-level index structure • String B-tree: Indexes the RLE-suffixes • Two-dimensional index: built on top of the leaves of the string B-tree

  10. String B-tree Overview[P. Ferragina and R. Grossi., Journal of ACM,1999] • S = LLLLLLLLLLEEEEEELLLLEEEHHHHHHHHHHHHHHHHHH 11 21 • Generate all suffixes of S • Insert the suffixes into the String B-tree (ordered alphabetically) • Store the logical keys instead of the key sequence • Several optimizations to achieve optimal theoretical bounds for: •  External memory space complexity •  Search I/O requirements (S,11) (S,12) (S,13) (S,21) Store logical pointers instead of the keys • >> Relative to the size of the raw (decompressed) sequences

  11. 4 6 8 10 5 1 4 2 3 (S,1) (S,4) (S,8) (S,10) (S,6) Implicit in L10 E6 L4 E3 H18 Implicit in E6 L4 E3 H18 String B-tree over RLE-suffixes • String B-tree CANNOT be used directly to index RLE-suffixes • RLE-suffixes are subset of the total suffixes L10 E6 L4 E3 H18 Order L10 E6 L4 E3 H18 E6 L4 E3 H18 L4 E3 H18 E3 H18 H18 • We indexed only subset of the suffixes (RLE-suffixes) • Searching for “L10 E6 L3” Found • Searching for “L5 E6 L3”  Not Found • Searching for “E3 L4”  Not Found

  12. RLE-suffixes L10 E6 L4 E3 H18 E6 L4 E3 H18 L4 E3 H18 E3 H18 H18 L5 E6 L3 L5 H2 Not part of the query answer L5 K10 L3 L6 E6 L3 SBC-Tree over RLE-suffixes • Query Pattern Mapping Rule: • Substring query pattern P = x1f1 x2f2 … xnfn is mapped into P’ = x1f1+ x2f2 … xnfn • Searching for “L5 E6 L3”  (L5+ E6 L3)  Found • Searching for “E3 L4” (E3+ L4)  Found Challenge: The answer set is no longer consecutive in the index tree  Unbounded number of I/Os to answer a query L5+ E6 L3

  13. SBC-tree: Insertion Procedure • Given an RLE sequence S = Ω1x1f1 x2f2 … xnfn • Insert S as the first suffix into the SBC-tree first level • 1 ≤ i ≤ n, insert RLE-suffix xi1 xi+1fi+1 … xnfninto the SBC-tree first level • Assign it a position tag T (Tag assignment problem) • Insert into the SBC-tree second level point = (T, fi)

  14. The answer set Preceding RLE-char f1 Suffix tag Max_tag Min_tag SBC-tree: Substring Searching • Given a query Q = y1f1 y2f2 … ymfm • Map Q into Q’ = y1f1+ y2f2 … ymfm • Search the String B-tree for Q’’ = y11 y2f2 … ymfm • Returns (min_tag, max_tag) as a contiguous range • Search the SBC-tree second level for suffixes with frequency >= f1 Two-dimensional index String B-tree

  15. SBC-Tree: Example P = A5 E3 B4 P’ = A5+ E3 B4 P’’ = A1 E3 B4

  16. SBC-Tree Variants • 3-sided structure[L. Arge, V. Samoladas, J. Vitter, PODS99] • External memory structure based on priority search tree and B-tree • Answers 3-sided range queries in 2D space • Provides optimal worst-case theoretical bounds for: • External memory space complexity • Insertion and deletion • 3-sided range query • R-tree • Available in all DBMSs • Provides good performance in practice • Does not have worst-case theoretical bounds for searching • One-Level SBC-tree • Remove the second level structure • Disadvantage: In queries scan many tuples outside the answer set

  17. SBC-tree Theoretical Bounds • Optimal external-memory space complexity O(N/B) • Optimal substring, prefix, and range searching in O(LogBN + (|p| +T)/B) I/O operations • Insertion and deletion in (m LogB(N+m)) amortized I/O operations

  18. SBC-tree Implementation • SBC-tree (R-tree variant) is implemented inside PostgreSQL • Query operators: • ^^ (substring search) • @@ (Prefix search) • ==<< ==>> (Range search) CREATE TABLE sequences (id INT, RLE_seq VARCHAR); CREATE INDEX ON sequences USING sbctree (seq); SELECT id FROM sequences WHERE RLE_seq ^^ ‘A5H7N2’; Substring searching operator

  19. SBC-tree Performance Analysis: Storage Requirements Comparing SBC-tree performance relative to String B-tree over uncompressed sequences Datasets SwissProt (Protein secondary structure) alphabet size = 3 WalMart (Sales profile time series) alphabet size = 5 Temperature (Time series of sensor readings) alphabet size = 52 Up to an order of magnitude saving in storage

  20. SBC-tree Performance Analysis:Insertion Comparing SBC-tree performance relative to String B-tree over uncompressed sequences Datasets SwissProt (Protein secondary structure) alphabet size = 3 WalMart (Sales profile time series) alphabet size = 5 Temperature (Time series of sensor readings) alphabet size = 52 Around 30% saving in Insertion

  21. SBC-tree Performance Analysis:Searching Comparing SBC-tree performance relative to String B-tree over uncompressed sequences Datasets SwissProt (Protein secondary structure) alphabet size = 3 WalMart (Sales profile time series) alphabet size = 5 Temperature (Time series of sensor readings) alphabet size = 52 • Retain the optimal search performance (only the query answer is retrieved) • Some additional overhead because of the two-level structure

  22. Summary • Addressing the challenge of storing and operating on compressed data inside DBMSs without decompression • Introduced the SBC-tree as an index for Run-Length Encoded (RLE) compressed sequences • SBC-Tree has optimal theoretical bounds for: • External memory space complexity • Search I/O requirements • Implementation inside PostgreSQL 22

  23. Thank you Mohamed Eltabakh (meltabak@cs.purdue.edu)

More Related