1 / 19

Efficient Bitmap Indexing Techniques for Very Large Datasets

Efficient Bitmap Indexing Techniques for Very Large Datasets. Kesheng John Wu Ekow Otoo Arie Shoshani. Problem Statement. Main objective: maps logical requests to qualified objects A logical request: 20001015<=eventTime & 200<energy<300 … Objects: Set of object ids;

moeshe
Download Presentation

Efficient Bitmap Indexing Techniques for Very Large Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani

  2. Problem Statement • Main objective: maps logical requests to qualified objects • A logical request: • 20001015<=eventTime & 200<energy<300 … • Objects: • Set of object ids; • Set of files containing the objects; • Offsets within the files, …

  3. Application: STAR A portion of the STAR tag dataset: 3 events with 12 attributes from millions of events with 502 attributes.

  4. Application: Combustion • Direct numerical simulation of auto-ignition process (solution of complex partial differential equations) • A dozen or more variables are computed at each time step and each grid point • Number of grid points: 2D 600 X 600 >>> 3D 1000 X 1000 X 1000 • Time steps: 100 >>> 1000s • Data size: 1 GB >>> 10 TB • Task: identify features and track them across time steps • E.G. Find flame front across time Find “600<temp<700” for 1 billion points per time step, and discover overlap between time steps • Use compressed bitmaps to accelerate both feature extraction and feature tracking 1000 X 1000 X 1000

  5. property 2 property 1 property n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . Building a Bitmap Index • Partition each property into bins (binning) • e.g. for 0<NLb<4000, 20 equal size bins: [0, 200)[200,400)… • Generate a bit vector for each bin (encoding) • Bit i of bit vector j is 1 iff NLb[i] is in bin j • Compress each bit vector

  6. Advantages of Bitmap Index • Bitmap index: specialized index that takes advantage • Read-mostly data: data produced from scientific experiments can be appended in large groups • Fast operations • “Predicate queries” can be performed with bitwise logical operations • Predicate ops: =, <, >, <=, >=, range, • Logical ops: AND, OR, XOR, NOT • They are well supported by hardware • Easy to compress, potentially small index size • Each individual bitmap is small and frequently used ones can be cached in memory

  7. Operation-efficient Compression Methods • Best known: byte-aligned bitmap code (BBC) • Uses run-length encoding (next slide) • Byte alignment, optimized for space efficiency • Decoding on bit level, not optimal for operations • Used in oracle • We developed a new word-aligned scheme: WAH • Uses run-length encoding • Word alignment • Designed for minimal decoding to gain speed

  8. Operation-efficient Compression Methods Based on variations of Run Length Compression Uncompressed: 0000000000001111000000000 ......0000001000000001111111100000000 .... 000000 Compressed: 12, 4, 1000,1,8,1000 Store very short sequences as-is Advantage: Can perform: AND, OR, COUNT operations on compressed data

  9. speed better BBC gzip PacBits ExpGol space Trade-off of Compression Schemes uncompressed WAH

  10. Information About the Test Machines • Hardware and system • Sun enterprise 450 (Ultrasparc II 400mhz) • 4GB RAM • VARITAS volume manager (stripped disk) • Real application data from STAR • Above 2 million objects, 12 attributes • Synthetic data • 100 million objects, 10 attributes • Terms • Compression ratio: ratio of compressed bitmaps size and uncompressed bitmaps size • Time reported are wall clock time in seconds

  11. Logical Operation Time(Synthetic Data) 10X improvement

  12. Logical Operation Time (STAR Data)Also 10X improvement

  13. Encoding Schemes – Main Idea Interval encoding Range encoding Equality encoding 12 bins 1 2 3 4 5 6 7 8 9 10 11 12 Interval, Range encoding:operates on 2 bins only!

  14. Total Effect of Compression and Encoding Schemes • Bottom line on queries • Compression scheme determines efficiency of logical operations • Encoding scheme determines number of operations • Range & interval – only one logical operation over 2 bitmaps • Equality – many operations depending on number of bins • But, space may be a consideration • What is the trade-off?

  15. Interval Encoding Is Better Overall(WAH Compression) Points on the graphs represent: 10, 20, 30, 50, 100 Bins. Average time for random range queries

  16. Timing Results

  17. Summary • Compressed bitmap indices are effective for range queries • Better compression scheme • 50% more space, but 12 time faster !!! • Among the different encoding schemes • The interval encoding is the overall winner

  18. Future Work • Support NULL value and categorical values • On-line update: add new data and update index without interrupting request processing • Recovery mechanism for robustness • Potential new applications: climate, astrophysics, biology (microarrays) • Study non-uniform binning strategies • Study more encoding schemes • Integrate with conventional database system: to better handle metadata, to provide more versatile front-end    

  19. Edge bin Edge bin Range(x) . . . . . . . . Range(y) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How Many Bins for Continuous Domains? More bins Less objects in edge bins Searching edge bins:skip-scan over “attribute vertical partition”

More Related