Enhanced HDF5 Indexing API for Large Scientific Datasets

Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group

144 MB/hr 200 GB/run Science Produces Large Datasets • Observation/experiment driven • Simulation driven • Information driven > 7GB/expt

Why Not Commercial DMBSs? • Proprietary format • Lack of portability • Low scalability • Lack of desirable access modes • Presence of expensive concurrency control and logging mechanism • Expensive parallel versions

State of the Art Not Enough • Scientific file formats and associated I/O APIs • Concentrating on HDF5 • Data recovery is navigational • Subsetting only on a small set of attributes

Why Indexes? Easy Not So Easy

Previous Indexing Efforts • Implicit indexing in HDF5 • JPL use of HDF Vdatas • HDF-EOS point data • PyTables • HDF5 internal B-Tree structures

Why a Standard Indexing API? • Avoid duplication of effort • PyTables • Standardize indexing in HDF5 • Standard API can be differently implemented • Make indexes portable • Store indexes in HDF5 files

H5IN API • Create_index • Parameters: location of index, location of data, binning information, memory limits • Returns: location of the index • Query • Parameters: dataset to query, query string • Returns: selection representing subset of the data corresponding to the query

Design Decisions • Limited scope of the prototype • Index stored in a separate dataset • Returns a selection • Projection index • Support for simple boolean queries

Limited Scope • 1st indexing prototype in HDF5 • Presence of implicit indexing • Index on single datasets • Query over single datasets • Conditions should be over a single dataset • Result could be mapped to a separate dataset

Location Data Pressure Temperature F1 F2 F3 Index Storage Root Group: / DAY1 DAY2 DAY3 DAY4

DAY3 LD_INDEX F1 F2 Location Data F1 F2 F3 Index Storage Root Group: /

T_IN P_IN Temperature Pressure Pressure Temperature Index Storage Root Group: / DAY3

Pressure Temperature Returns a Selection FIND PRESSURE WHERE TEMP IN [100, 200] • Concise Storage • Efficient Boolean operations

Projection Index

Binning

Temp 40 C 50 A 60 Pressure 29 30 B 31 Projection Index

Why Projection Index ? • Data is read only • Mostly dataset once written is not changed • Index does not need to be updated • Projection indexes well suited • Number of disk accesses is same as in case of a B-Tree • Are not considering multidimensional queries

Only Simple Boolean Queries • Query Format SELECT SELECTION WHERE c11 < Attribute1 < c12 AND c21 < Attribute2 < c22 … • Results being selections boolean operations can be done inside the library

Conclusion • Developing a standard indexing API in HDF5 • Creating a proof of concept prototype using projection indexes • Take first step towards developing a query language for HDF5

Future Work • Multi-dimensionality • Multiple datasets in same file • Multiple datasets across files • Indexes on attributes • Allow user to index subset of datasets

Enhanced HDF5 Indexing API for Large Scientific Datasets

Enhanced HDF5 Indexing API for Large Scientific Datasets

Presentation Transcript

HDF5 Tools

Parallel HDF5

Indexes

Indexes

Indexes

Indexes

Parallel HDF5

New Features in HDF5

Indexes

Indexes

Indexes

HDF5 Tools in

Indexes

Projection in CCTV

Indexes

Indexes

New Features in HDF5

Indexes

Indexes:

Indexes