1 / 26

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps. Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio State University Hakan Ferhatosmanoglu – The Ohio State University Ali Saman Tosun – University of Texas at San Antonio.

Anita
Download Presentation

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio State University Hakan Ferhatosmanoglu – The Ohio State University Ali Saman Tosun – University of Texas at San Antonio

  2. Presentation Outline • Motivation • Goal • Approximate Bitmaps (AB) encoding • AB example • Theoretical analysis • Experiments and Results • Conclusion

  3. Motivation • Bitmap indices • Data warehouses • Scientific data • Visualization applications • Bitwise operations • Bitmap Compression • Run-length encoders • Word Aligned Hybrid (WAH) • Byte-aligned Bitmap Code (BBC)

  4. Motivation • The row numbers do not longer correspond to the bit position in the bitmap • Queries over few particular rows • As expensive as queries asking for all the rows • Commonly, users are only interested in a small subset of the dataset at a time. • For example: • A query over the transactions of the last 7 days • Spatial queries over objects in a specific geographical area

  5. Motivation • Visualization applications • Millions of different readings ordered by their geographic location • Users ask range queries over some of the readings for a given area • The answers are highlighted in the screen • Several degrees of resolution make approximate answers acceptable

  6. Our Goal • Enable direct access over any subset of the bitmap • Achieve effective compression • Maintain bitwise operations for query execution • Trade-off efficiency vs. accuracy • No false negatives

  7. The approach • Our solution is inspired by Bloom Filters • A 2m bit array indexed using k independent hash functions • A data object is inserted by setting the k positions in the array corresponding to the hash values of the object • False positives can happen, but false negatives cannot

  8. Approximate Bitmaps (AB) • A bloom filter-like structure • Only the set bits are inserted into the AB • Three levels of encoding: • Per table, per attribute, per bitmap column • Parameters: • The hash string mapping function, F • The k hash functions, {H1(x),…,Hk(x)} • The size of the AB, n = αs = 2m • Precision in terms of α and k, ~(1-(1-e-k/α)k)

  9. AB Example • A bitmap table for a dataset with 8 rows and 3 attributes. Each attribute is divided into 3 categories. • Bitmap Table Size: 72 bits • Number of set bits = 24. • F(i,j) = concatenate(i,j) = x • H1(x) = x mod 32 • m = 5 • AB Size: 25 = 32 bits

  10. AB Example - Insertion • Initially all bits in the AB are zero • To insert set bit in (1,1)

  11. AB Example - Insertion • To insert set bit in (1,1) • x = 11 • H(11) = 11 mod 32 = 11 • AB(11) = 1

  12. AB Example - Insertion • To insert set bit in (5,4) • x = 54 • H(54) = 54 mod 32 = 22 • AB(22) = 1

  13. AB Example - Insertion • After all insertions

  14. AB Example - Analysis • Estimated Precision: • α = ABSize/Set Bits • α = 32/24 = 1.33 • k = 1 • FP = (1-e-k/α) • P = 1-FP • P = 1-(1-e-1/1.33) • P = 47% • The underlined positions are false positives • Only 8 out of the 48 zeros are set in the AB

  15. AB Example - Retrieval • Row 4: • (4,7): H(47) = 15 • AB(15)=0 • (4,8): H(48) = 16 • AB(16)=1 • Row 5: • (5,7): H(57) = 25 • AB(25)=1 • Stop • Consider this query, asking for 4 rows • This a range query over 4 rows, where the third attribute falls into C1 or C2

  16. AB Example - Retrieval • Row 6: • (6,7): H(67) = 3 • AB(67)=1 • Stop • Approx Query Answer: • {1,1,1,0} • Exact Answer: • {0,1,1,0} • Consider this query, asking for 4 rows

  17. Approximate Bitmaps (AB) – Mapping Function F • F maps each cell in the bitmap table to a unique string (the hashing string) • For one AB per table and one AB per attribute, the bit in row i column j is identified by • F(i,j) = i << w || j, where w is large enough to accommodate all j • For one AB per column, the bit in row i is identified by • F(i,j) = i

  18. Approximate Bitmaps (AB) – Hash Functions • Single Hash Function • Called once and the result is divided into pieces. • Each piece considered as the value of a different hash function. • Secure Hash Algorithm (SHA), developed by National Institute of Standards and Technology (NIST) • Multiple Hash Functions • Independent hash functions • For large number, similar performance Hash Function H0 H1 H2 ... H9 Bits 159..144 143..128 127..112 ... 15..0 SHA Output 0100100010001010 1000010100100001 0111100011100010 ... 0000010101110011

  19. Approximate Bitmaps (AB) – FP Rate • FP Rate: Probability that all k bits are set by another data object • n is the size of the AB • s is the number of set bits • n = αs, α = n/s

  20. Approximate Bitmaps (AB) – Size • In terms of α: • n = αs • m = ceil(log2(αs)) • One AB per dataset: • s = |A|*N • One AB per attribute: • s = N • One AB per column: • s depends on the data distribution

  21. Experimental Setup • Three datasets: • Query by sampling (randomly selecting the columns queried) • Varying the number of rows queried from 100 to 10K

  22. Experimental Results - Size • Always use the max α that produces a smaller or comparable AB than WAH

  23. Experimental Results - Precision • As αincreases, the precision increases steadily and is very close to 1 for larger α • Precision increases as k increases up to the optimum point • Because large number of hash functions produces more collisions

  24. Experimental Results – Exec Time • Execution time of the AB depends on the number ofrows queried, not in the number of rows in the dataset • For queries over less than 10%~15% of the rows, AB execution is up to 3 orders of magnitude faster than WAH

  25. Conclusion • AB encoding approximates the bitmaps using multiple hashing of the set bits • Allows efficient retrieval of any subset of rows and columns • Trade-off between bitmap size and precision • Three levels of encoding • Approximate query answers are given without database access

  26. Questions and Comments • Thank you! Email: canahuat@cse.ohio-state.edu

More Related