Strategies for processing ad hoc queries on large data warehouses
Download
1 / 24

strategies for processing ad hoc queries on large data warehouses - PowerPoint PPT Presentation


  • 267 Views
  • Uploaded on

Strategies for Processing Ad Hoc Queries on Large Data Warehouses. Kurt Stockinger CERN John Wu & Arie Shoshani Lawrence Berkeley National Lab. Outline. Motivation for designing our own software Many large scientific data warehouses need to process ad hoc queries Lack of efficient indices

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'strategies for processing ad hoc queries on large data warehouses' - Leo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Strategies for processing ad hoc queries on large data warehouses l.jpg

Strategies for ProcessingAd Hoc Querieson Large Data Warehouses

Kurt Stockinger

CERN

John Wu & Arie Shoshani

Lawrence Berkeley National Lab


Outline l.jpg
Outline

  • Motivation for designing our own software

    • Many large scientific data warehouses need to process ad hoc queries

    • Lack of efficient indices

  • Issues to discuss

    • Vertical partitioning

    • Bitmap index

      • Compression – how to store the bitmaps

      • Persistent storage – where to store the bitmaps

strategies for processing ad hoc queries


Example high energy physics experiment star l.jpg
Example: High-Energy Physics Experiment STAR

  • Current data size

    • 20 million collision events

    • each event ~10 KB in size

  • Production data rate

    • 100 million records / year

    • ~ 1 TB per year

  • Scientists may query any of the 500 or so attributes

  • Each query may involve conditions on 5 ~ 8 attributes

    • Energy > 100 & Particles > 500 & …

  • Near real-time evaluation desired

strategies for processing ad hoc queries


Many scientific applications involve large datasets l.jpg
Many Scientific Applications Involve Large Datasets

  • Sloan Digital Sky Survey: http://www.sdss.org

  • Earth Observing System: http://eos.nasa.gov

  • Large Hadron Collider: http://lhc.web.cern.ch

  • Genomes to life: http://doegenomestolife.org

  • Combustion: http://scidac.psc.edu

  • PCMDI: http://www-pcmdi.llnl.gov

strategies for processing ad hoc queries


Searching and indexing requirements l.jpg
Searching and Indexing Requirements

  • Some common features of the large scientific datasets

    • Read-mostly: data warehouses

    • Large high-dimensional data: millions or billions of records, each record with tens or hundreds of attributes

    • Many queries are high-dimensional partial range queries

    • Most users desire to modify queries interactively

  • Existing database software not specialized for these tasks: slow

  • Need new special purpose software

    • BMI: bitmap index, CERN

    • IBIS: independent bitmap index and search, LBNL

strategies for processing ad hoc queries


Issues to be discussed l.jpg
Issues to Be Discussed

  • Organization of the primary data, i.e., the user data

    • Viewing the primary data as a 2-D table

      • Horizontal partition: used in transactional systems

      • Vertical partition: good for partial range queries

  • Indexing strategies:

    • Tree based schemes: not effective for dimensions > 10

    • Bitmap index: well suited for partial range queries

  • Storage scheme for the index data

    • BMI: Store bitmaps as objects in an object-oriented database (ODBMS)

    • IBIS: Store bitmaps as simple files

strategies for processing ad hoc queries


Horizontal vs vertical partitioning l.jpg

Horizontal partitioning

Data elements of a record are stored consecutively

Good for accessing one record at a time

Used in relational DBMS systems where records are frequently updated

Typically 60~70% of bytes of each page is used

Vertical partitioning

All records of an attribute are stored consecutively

Good for accessing multiple records by attribute selection

Suitable for data warehousing systems where records are rarely modified

May use 100% of bytes of each page

Horizontal vs. Vertical Partitioning

strategies for processing ad hoc queries


Performance advantage of vertical partitioning l.jpg

Experiment with 2.2 million records of STAR data (10 attributes only)

The figure on the right shows the time to search without an index

Query box size is the relative volume of the hypercube formed by range conditions

The disk system supports about 20 MB/s sustained reading

For answering a query like “A > 5”, the time used by a relational DBMS is proportional to number of attributes in the table

500 attributes, 500 times slower

Performance Advantage of Vertical Partitioning

Vertical partitioning is effective

for partial range queries

strategies for processing ad hoc queries


Brief overview of index data structures l.jpg
Brief Overview of Index Data Structures attributes only)

  • One dimensional index data structures:

    • Total order for one-dimension

    • Hash-based: Optimized for exact match queries, e.g. E = 106

    • Tree-based: Optimized for range queries, e.g. E < 106

      • Most widely used: B+-tree (1972):

  • Multidimensional index data structures

    • No total order for all dimensions

    • Hash-based: Grid-File, Bang-File, …

    • Tree based: R-Trees, Pyramid-Tree, …

    • Bitmap Indices: Effective for data warehousing environments

    • Linearize to introduce total order, then use one-dimensional indices

strategies for processing ad hoc queries


Basic bitmap index l.jpg
Basic Bitmap Index attributes only)

a) List of attributes b) Bitmap Index (equality encoding)

Bit Slice E2 encodesattributes with value 2

a) List of 12 attributes with 10 distinct attribute values, i.e attribute cardinality = 10

b) For each distinct attribute value, one bit slice is created, i.e bitmap index consists of 10 bitmaps (E0 to E9)

strategies for processing ad hoc queries


Pros and cons of bitmap indices l.jpg
Pros and Cons of Bitmap Indices attributes only)

  • Pros:

    • Easy to build and to maintain

    • Easy to identify records that satisfy a complexmulti-attribute predicate (multi-dimensional ad-hoc queries)

    • Very space efficient for attributes with low cardinality (number of distinct attribute values, e.g. “Yes”, “No”)

  • Cons:

    • Space inefficient for attributes with high cardinality

    • An effective strategy: Bitmap Compression

    • Other strategies: binning, encoding

strategies for processing ad hoc queries


Bitmap compression l.jpg
Bitmap Compression attributes only)

  • Advantages:

    • Less disk space for storing indices

    • Indices can be read from disk faster

    • More indices can be cached in memory

  • Possible problems:

    • Increases the complexity of the software

    • If bitmaps must be decompressed before performing Boolean operations, the decompression overhead might outweigh the advantages of compression

      • Use compression schemes that work directly on compressed data

strategies for processing ad hoc queries


Various bitmap compression algorithms l.jpg
Various Bitmap Compression Algorithms attributes only)

  • Run Length Encoding (RLE):

    • one-sided (asymmetric) vs. two-sided (symmetric)

  • Gzip (Lempel-Ziv, LZ):

    • verbatim (uncompressed) bitmap is compressed via zlib

  • ExpGol:

    • Variablebit length encoding (RLE-bitmap is compressed)

  • Byte-Aligned Bitmap Compression (BBC):

    • Variablebyte length encoding (Oracle patent)

    • One-sided vs. two-sided (BBC1 vs. BBC2)

  • Word-Aligned Hybrid (WAH):

    • Fixed word based encoding

strategies for processing ad hoc queries


Relative strength of different compression schemes l.jpg

speed attributes only)

uncompressed

WAH

better

BBC

gzip

PacBits

ExpGol

space

Relative Strength of Different Compression Schemes

strategies for processing ad hoc queries


Wah compression bitmap index implementations l.jpg
WAH Compression & Bitmap Index Implementations attributes only)

  • Compression Schemes

    • Designed for reducing the CPU-complexity of logical operations when compared to BBC, 10 X speedup

    • However, lower compression factor, i.e. the sizes of the WAH-compressed bitmaps are some 40-60% larger than BBC-compressed bitmaps

  • Storage scheme

    • BMI: Bitmap Index implementation on top of ODBMS (CERN)

    • IBIS: Bitmap Index implementation based on plain files (LBL)

strategies for processing ad hoc queries


Test setup l.jpg
Test Setup attributes only)

  • Real application data (STAR) : 2.2 million records

  • Synthetic dataset I: 100 million records

  • Synthetic dataset II: 5 million records

  • Only the performance of the bitwise logical operation “AND” is reported

  • Other logical operations such as OR, XOR, etc. show similar relative differences

  • Most of the benchmarks were executed on three different machines with various CPU and I/O subsystems

strategies for processing ad hoc queries


In memory logical operation and l.jpg
In Memory Logical Operation attributes only)“AND”

On dms, 300MHz PII

On tin, 400MHz P3

On dm, 450MHz

UltraSPARC

WAH is always the fastest, 2X – 20X

strategies for processing ad hoc queries


Search time including file io l.jpg
Search Time (Including File IO) attributes only)

On dm, 20MB/s IO

On tin, 2MB/s IO

To answer the queries: read two bitmaps from files, perform one logical “AND”

Unless using a very slow disk, it is worth-while to use WAH compression

strategies for processing ad hoc queries


With bbc searching operation spends little time in io l.jpg
With BBC, Searching Operation Spends Little Time in IO attributes only)

On dm, 20MB/s IO

On tin, 2MB/s IO

  • The percentage of time spent in IO on different bitmaps

  • This percentage is expected to be high, but it is actually low with BBC

  • WAH reduce CPU time, and searching is again IO bound

strategies for processing ad hoc queries


Sizes of compressed bitmaps l.jpg
Sizes of Compressed Bitmaps attributes only)

BBC-s: simplified (LBL)BBC-f: full (AT&T + CERN)

The total size of a bitmap index compressed with WAH is typically 40-60% larger than that compressed with BBC

strategies for processing ad hoc queries


Sizes of compressed bitmaps21 l.jpg

The figure on the right plot the maximum size of the bitmap index against the attribute cardinality of an attribute with 100 million (108) records

In the worst case, the size of the compressed bitmap index is about 400 million words, 4 times the size of the primary data

For most high-cardinality attributes, the compressed bitmap index size is smaller than that of a typical B-tree index(~ 3X primary data)

Sizes of Compressed Bitmaps

B-tree

The compressed bitmap index sizes

are usually smaller than B-tree

strategies for processing ad hoc queries


Query performance ibis vs rdbms l.jpg
Query Performance index against the attribute cardinality of an attribute with 100 million (10IBIS vs. RDBMS

  • Accessing bitmaps in files (IBIS) has about the same efficiency as accessing bitmaps within an RDBMS

  • The DBMS tested uses a BBC compressed bitmap index similar to our BBC compressed index

  • Used real application data

WAH compressed index is 4X more efficient than BBC compressed index

strategies for processing ad hoc queries


Query performance file ibis vs odbms bmi l.jpg
Query Performance index against the attribute cardinality of an attribute with 100 million (10File (IBIS) vs. ODBMS (BMI)

  • Figures on the left time needed to process 5-dimensional queries on tin

  • Queries on synthetic data

  • IBIS with WAH uses the least amount of time

  • ODBMS overhead 4X

  • Due to file system caching, IBIS is ~10X faster on files that have been accessed before (“warm” files)

a) “cold” files

b) “warm” files

strategies for processing ad hoc queries


Conclusions l.jpg
Conclusions index against the attribute cardinality of an attribute with 100 million (10

  • We have shown that BBC is CPU-bound rather than I/O-bound as assumed in the past

  • WAH is much more (10X) CPU-efficient than BBC

  • Building bitmap indices on top of ODBMS introduces about 4X overhead when compared to using plain files

  • Building bitmap indices inside DBMS (as in many commercial systems) shows higher efficiency

  • Processing multi-dimensional range queries is efficient with WAH compressed bitmap indices

  • Read-only data should be vertically partitioned

strategies for processing ad hoc queries


ad