bitmap indices for data warehouse
Download
Skip this Video
Download Presentation
Bitmap Indices for Data Warehouse

Loading in 2 Seconds...

play fullscreen
1 / 22

Bitmap Indices for Data Warehouse - PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on

Bitmap Indices for Data Warehouse. Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Feb 27, 2009. Star Schema Vs. Multi-dimensional Range Queries. SUM (qty * amt) WHERE ProdId in [p1.. p10] AND custId < 200.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Bitmap Indices for Data Warehouse' - maisie-humphrey


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bitmap indices for data warehouse

Bitmap Indices for Data Warehouse

Jianlin Feng

School of Software

SUN YAT-SEN UNIVERSITY

Feb 27, 2009

star schema vs multi dimensional range queries
Star Schema Vs. Multi-dimensional Range Queries

SUM (qty * amt)

WHERE ProdId in [p1.. p10] AND custId < 200

characteristics of multi dimensional range queries in data warehouse
Characteristics of Multi-Dimensional Range Queries in Data Warehouse
  • Ad-Hoc
    • Give N dimensions (attributes), every combination is possible: 2Ncombinations
    • A Data Cube equals to 2N GROUP-Bys
  • High Dimensions ( > 20)
  • Large Number of Records
multi dimensional index fails
Multi-Dimensional Index Fails!
  • R-Trees or KD-Trees
    • Effective only for moderate number of dimensions
    • Efficient only for queries involving all indexed dimensions.
  • For Ad-hoc Rang Queries, Projection Index is usually better, and Bitmap Index is even better.
projection index
Projection Index
  • Fix the order of the records in the base table
    • Store
  • Project records along some dimension
    • i.e, A single Column
    • Keeping the record order
    • Keeping the duplicates
  • Like “array” in C language

base table

Projection Index

multi dimensional range queries a general idea
Multi-dimensional Range Queries : A General Idea
  • Build an index for each dimension (attribute);
    • A Projection Index
    • A B-Tree
      • 1 Primary B-Tree, N -1 Secondary B-Trees
  • For each involved dimension, use the index on that dimension to select records;
  • “AND” the records to get the final answer set.
how to make the and operation fast
How to make the “AND” operation fast?
  • Projection Index (B-Tree is similar)
    • Scan each involved dimension,
    • And return a set of RIDs.
    • Intersection the RID sets
      • Sets have different lengths
      • We can use Sort and Merge to do the Intersection
    • Life is easier
      • when all the sets have the same length and in the same order
      • Use 1/0 to record the membership of each record
general ideas of bitmap index
General Ideas of Bitmap Index
  • Fix the order of records in the base table
  • Suppose the base table has m records
  • For each dimension
    • For each distinct dimension value (as the KEY)
    • Build a bitmap with m bits (as the POSITIONS)
    • A bitmap is like an Inverted Index
  • “AND”, “OR” operations
    • realized by bitwise logical operations
    • Well supported by hardware
size of bitmap indices
Size of Bitmap Indices
  • Number of Bitmap (Indices)
    • How to build bitmap indices for dimensions with large distinct values
    • Temperature dimension
  • Size (i.e., Length) of a Single Bitmap
three solutions
Three Solutions
  • Encoding
    • Reduce the Number of Bitmaps
  • Binning
    • Reduce the Number of Bitmaps
  • Compression
    • Reduce the Size of a Single Bitmap
encoding strategies
Encoding Strategies
  • Equality-encoded
    • Good for equality queries,such as “temperature == 100”
    • Basic Bitmap Index
  • Bit-sliced index
    • Assume dimension A has c distinct values, use log2c bitmap indices to represent each record (its value)
  • Range-encoded
    • Good for one-sided range queries, such as “Pressure < 56.7”
  • Interval-encoded
    • Good for two-sided range queries, such as“35.8 < Pressure < 56.7”
binning
Binning
  • Encoding mainly considers discrete dimension values
    • Usually integers
  • Basic Ideas of Binning
    • Build a bitmap index for a bin instead of for a distinct value
    • The Number of Bitmaps has nothing to do with the number of distinct values in a dimension.
  • Pros and Cons
    • Pros:control the number of bitmap via controling the number of bins.
    • Cons:need to check original dimension values to decide if the records really satisfy query conditions.
compression strategies
Compression Strategies
  • General-purpose compression methods
    • Software packages are widely available
    • Tradeoff between query processing and compression ration
      • De-compress data first
  • Specific methods
    • BBC (Byte-aligned Bitmap Code ), Antoshenkov,1994,1996.
      • Adopted since Oracle 7.3
    • WAH(Word-aligned Hybrid Bitmap code ), Wu et al 2004, 2006.
      • Used in Lawrence Berkeley Lab for high-energy physics
wah word aligned hybrid bitmap code
WAH(Word-aligned Hybrid Bitmap code )
  • Based on run-length encoding
    • For consecutive 0s or 1s in a bit sequence (part of a bitmap)
  • Use machine WORD as the unit for compression
    • Instead of BYTE in BBC
  • Design Goal:
    • reduce the overhead of de-compression, in order to speed-up query response.
run length encoding
Run-length encoding
  • Bit sequence B: 11111111110001110000111111110001001
  • fill:a set of consecutive identical bits (all 0s or all 1s)
    • The first 10 bits in B
    • fill = count“+”bit value
    • 1111111111=10 “+” 1
  • tail: a set of mizxed 0s and 1s
    • The last 8 bits in B
  • Run:
    • Run = fill + tail
  • Basic Ideas of WAH
    • Define fill and tail appropriately so that they can be stored in WORDs.
wah vs b tree vs bbc
WAH vs. B-tree vs. BBC
  • On one dimensional range queries
    • The query response time grows linearly in the number of hits.
  • B-tree has the same time complexity
    • Records selected by each single dimension can not be easily combined.
  • Query response time
    • I/O + CPU
    • I/O: WAH > BBC
    • CPU: BBC > WAH
    • In Total:using WAH is 10 times faster than using BBC
characteristics of industrial products
Characteristics of Industrial Products
  • Model 204. (Pat O’Neil,1987)
    • The first that adopted bitmap index
    • Basic Bitmap Index, No binning, No compression
    • Now owned by Computer Corporation of America
  • Oracle (1995)
    • Adopted compressed bitmap index since 7.3
    • Probably use BBC for compression, Equality-encoded, No binning.
  • Sybase IQ
    • bit-sliced index(Pat O’Neil et al,1997)
    • No binning, No compression
    • For dimension with small number of distinct values, use Basic Bitmap Index.
references
References
  • Kurt Stockinger, Kesheng Wu, Bitmap Indices for Data Warehouses, In Wrembel R., Koncilia Ch.: Data Warehouses and OLAP: Concepts, Architectures and Solutions. Idea Group, Inc. 2006.
ad