1 / 21

Projection Indexes in HDF5

Projection Indexes in HDF5. Rishi Rakesh Sinha The HDF Group. 144 MB/hr. 200 GB/run. Science Produces Large Datasets. Observation/experiment driven. Simulation driven. Information driven. > 7GB/expt. Why Not Commercial DMBSs?. Proprietary format Lack of portability Low scalability

warren-wynn
Download Presentation

Projection Indexes in HDF5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group

  2. 144 MB/hr 200 GB/run Science Produces Large Datasets • Observation/experiment driven • Simulation driven • Information driven > 7GB/expt

  3. Why Not Commercial DMBSs? • Proprietary format • Lack of portability • Low scalability • Lack of desirable access modes • Presence of expensive concurrency control and logging mechanism • Expensive parallel versions

  4. State of the Art Not Enough • Scientific file formats and associated I/O APIs • Concentrating on HDF5 • Data recovery is navigational • Subsetting only on a small set of attributes

  5. Why Indexes? Easy Not So Easy

  6. Previous Indexing Efforts • Implicit indexing in HDF5 • JPL use of HDF Vdatas • HDF-EOS point data • PyTables • HDF5 internal B-Tree structures

  7. Why a Standard Indexing API? • Avoid duplication of effort • PyTables • Standardize indexing in HDF5 • Standard API can be differently implemented • Make indexes portable • Store indexes in HDF5 files

  8. H5IN API • Create_index • Parameters: location of index, location of data, binning information, memory limits • Returns: location of the index • Query • Parameters: dataset to query, query string • Returns: selection representing subset of the data corresponding to the query

  9. Design Decisions • Limited scope of the prototype • Index stored in a separate dataset • Returns a selection • Projection index • Support for simple boolean queries

  10. Limited Scope • 1st indexing prototype in HDF5 • Presence of implicit indexing • Index on single datasets • Query over single datasets • Conditions should be over a single dataset • Result could be mapped to a separate dataset

  11. Location Data Pressure Temperature F1 F2 F3 Index Storage Root Group: / DAY1 DAY2 DAY3 DAY4

  12. DAY3 LD_INDEX F1 F2 Location Data F1 F2 F3 Index Storage Root Group: /

  13. T_IN P_IN Temperature Pressure Pressure Temperature Index Storage Root Group: / DAY3

  14. Pressure Temperature Returns a Selection FIND PRESSURE WHERE TEMP IN [100, 200] • Concise Storage • Efficient Boolean operations

  15. Projection Index

  16. Binning

  17. Temp 40 C 50 A 60 Pressure 29 30 B 31 Projection Index

  18. Why Projection Index ? • Data is read only • Mostly dataset once written is not changed • Index does not need to be updated • Projection indexes well suited • Number of disk accesses is same as in case of a B-Tree • Are not considering multidimensional queries

  19. Only Simple Boolean Queries • Query Format SELECT SELECTION WHERE c11 < Attribute1 < c12 AND c21 < Attribute2 < c22 … • Results being selections boolean operations can be done inside the library

  20. Conclusion • Developing a standard indexing API in HDF5 • Creating a proof of concept prototype using projection indexes • Take first step towards developing a query language for HDF5

  21. Future Work • Multi-dimensionality • Multiple datasets in same file • Multiple datasets across files • Indexes on attributes • Allow user to index subset of datasets

More Related