1 / 21

High-dimensional indexing techniques

High-dimensional indexing techniques. Kesheng John Wu Ekow Otoo Arie Shoshani. The big picture. Large. Distributed. Data mining. dataset. Request Interpreter. file. MPI-IO. storage. grid. The big picture. Logical request. Request interpreter. LBNL. Qualified objects.

kreese
Download Presentation

High-dimensional indexing techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani

  2. The big picture Large Distributed Data mining dataset Request Interpreter file MPI-IO storage grid

  3. The big picture Logical request Request interpreter LBNL Qualified objects Request planning/execution PPDG Sub-task schedule Execution services MPI-IO, … grid

  4. Problem statement • Main objective: maps logical request to qualified objects • a logical request: • 20001015<=eventTime & 200<energy<300 … • objects: • set of object IDs; • set of files containing the objects; • offsets within the files, …

  5. Requirements & Status • General requirements • User request data in terms of their scientific domain, not file names or offsets in files • Each object may be described in hundreds of attributes • Each request is in terms of range predicates on a handful of attributes (partial range query) • Status • Initially motivated by a HENP experiment: STAR • Software originally developed under GC and is currently in use at BNL

  6. Large high-dimensional datasets • Number of attributes / columns: 200 – 500 • Number of objects / events: 108 – 109 • File containing one attribute: 400MB – 4GB • Total size over all attributes: 80GB – 2TB Object ID A1 A2 A3 A4 … 0 1 2 . . . Curse of dimensionality • Goal: develop an index, so that: • Read as little as possible from disk • Minimize computation in memory 108 . . . 109

  7. Well known indexing methods • B-tree based indices • One or a small number of attributes • Index size may be up to 3 times the data size • R-tree based indices • Small number of attributes, say, < 10 • UB-tree • Use space filling curves to map high-dimensional data to one-dimension • One range query is mapped into many many queries on the B-tree based index • Even sequential scan • Better than B-tree and R-tree if dimension > 10 • Simply read all data and compare take too long

  8. Another class of indexes: Bitmap index Data values • Example queries on the attribute, say, A • One-sided range query: A < 2 • b0 OR b1 • Two-sided range query: 2<A<5 • b3 OR b4 • Basic steps of building a bitmap index • Binning • Encoding • Compressing b0 b1 b2 b3 b4 b5 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 5 3 1 2 0 4 1 =0 =1 =2 =3 =4 =5

  9. Edge bin Edge bin Range(x) . . . . . . . . Range(y) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How many bins? More bins Less objects in edge bins

  10. Interval encoding Range encoding Equality encoding 0 1 2 3 4 5 6 bins How to encode

  11. Advantages of bitmap indices • Fast operations • The most common operations are the bitwise logical operations • They are well supported by hardware • Easy to compress, potentially small index size • Each individual bitmap is small and frequently used ones can be cached in memory • Efficient for read-mostly data: data produced from scientific experiments can be appended in large groups • Available in most major commercial DBMS

  12. Why our own bitmap index • Early tests shown that we can do an order of magnitude better than ORACLE (using equality encoding) • Vertical partition: allows one to only read data of the attributes involved in a query • New compression method • Best known: Byte-aligned Bitmap Code (BBC) • Developed 2 Word-Aligned Schemes: WAH, WBC • Different encoding schemes under compression • Equality encoding – used in ORACLE and others • Range encoding – one-sided range queries • Interval encoding – two-sided range queries

  13. Information about the test machines • Hardware and system • Sun enterprise 450 (Ultrasparc II 400MHz) • 4GB RAM • VARITAS volume manager (stripped disk) • Real application data from STAR • Above 2 million objects • Picked 12 attributes with varying distributions • Measures: • Logical operation time without IO • Logical operation time with IO • Query processing time

  14. Logical operation time (no IO)

  15. Logical operation time (including IO)

  16. New compression schemes • Overall, use about 50% more space than BBC • On average, 12 times faster than BBC • Faster than the uncompressed in more cases: • New schemes are faster than the uncompressed scheme when the compression ratios are less than 0.3 • BBC is faster than the uncompressed when the compression ratios are less than 0.03

  17. Sizes of bitmap indices • Conclusion: • equality encoding is most space efficient • Compression gain is at least a factor of 2.5

  18. Average query processing time • Conclusion: • interval and range encoding are the best • For these cases, there is practically no penalty to compression

  19. Interval encoding is better overall Sequential scan time: 0.557 sec

  20. Summary • Better compression scheme • 50% more space, but 10-12 time faster !!! • Among the different encoding schemes • the interval encoding is the better than the equality encoding and the range encoding • Selecting the number of bins => Bitmap index size and operation efficiency. For example: • 10% of data size => 3 x speed of sequential scan • 20% of data size => 6 x speed of sequential scan • Equality encoding currently used in the STAR experiment. Next version will include the interval encoding.

  21. Future work • Support NULL value and categorical values • On-line update: add new data and update index without interrupting request processing • Recovery mechanism for robustness • Potential new applications: climate, astrophysics, biology • Study different non-uniform binning strategies • Integrate with conventional database system: to better handle metadata, to provide more versatile front-end

More Related