1 / 56

HDF5 Datasets and I/O

HDF5 Datasets and I/O. Dataset storage and its effect on performance. Outline. Dataset metadata and array data storage layouts Types of dataset storage layouts Factors affecting I/O performance I/O with compact datasets I/O with contiguous datasets I/O with chunked datasets

tulia
Download Presentation

HDF5 Datasets and I/O

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HDF5 Datasets and I/O Dataset storage and its effect on performance HDF5 Workshop at PSI

  2. Outline • Dataset metadata and array data storage layouts • Types of dataset storage layouts • Factors affecting I/O performance • I/O with compact datasets • I/O with contiguous datasets • I/O with chunked datasets • Variable length data and I/O HDF5 Workshop at PSI

  3. HDF5 Layers HDF5 Application Application buffer HDF5 Object Layer (API) H5Dwrite is called Data is prepared for I/O HDF5 Internals VFD Layer SEC2 driver performs I/O HDF5 file HDF5 Workshop at PSI

  4. Goal of this talk • Present what is happening to data inside the HDF5 library • Show how application can control the HDF5 library behavior • Specifically: • Describe some basic operations and data structures and explain how they affect performance and storage sizes • Give some “recipes” for how to improve performance HDF5 Workshop at PSI

  5. HDF5 dataset metadata HDF5 Workshop at PSI

  6. HDF5 Dataset • Data array • Also called raw data • Metadata • Dataspace • Rank, dimensions of dataset array • Datatype • Information on how to interpret data • Storage Properties • How array is organized on disk • Attributes • User-defined metadata (optional) HDF5 Workshop at PSI

  7. HDF5 dataset components Dataset header Dataset data array Dataspace Rank Dimensions 3 Dim_1 = 4 Dim_2 = 5 Dim_3 = 7 Datatype IEEE 32-bit float Attributes Storage info Time = 32.4 Chunked Pressure= 987 Compressed Temp = 56 Metadata Raw data HDF5 Workshop at PSI

  8. HDF5 metadata • HDF5 metadata • Information about HDF5 objects used by the HDF5 library • Examples: object headers, B-tree nodes for group, B-Tree nodes for chunks, heaps, super-block, etc. • Usually small compared to raw data sizes (KB vs. MB-GB) HDF5 Workshop at PSI

  9. HDF5 metadata cache Applicationmemory Metadata cache (MDC) Dataset array data Datasetheader Dataset header resides in MDC. MDC is handled by HDF5 library HDF5 metadata Dataset array data HDF5 File Metadata is mixed with raw data in HDF5 file HDF5 Workshop at PSI

  10. HDF5 metadata cache • Metadata cache • Space allocated to handle pieces of the HDF5 metadata • Allocated by the HDF5 library in application’s memory space • Allocated per file; released when file is closed • Metadata cache behavior affects overall performance • Metadata cache implementation prior to HDF5 1.6.5 could cause performance degradation for some applications HDF5 Workshop at PSI

  11. HDF5 dataset storage layouts HDF5 Workshop at PSI

  12. HDF5 datasets storage layouts • Contiguous • External • Chunked • Compact HDF5 Workshop at PSI

  13. Contiguous storage layout • Contiguous storage layout is a default storage layout for an HDF5 dataset • Dataset raw data is stored in one contiguous block in HDF5 file HDF5 Workshop at PSI

  14. Contiguous storage layout Applicationmemory Metadata cache (MDC) Dataset array data Datasetheader Dataset array data Datasetheader HDF5 File Raw data is stored in one contiguous block in HDF5 file HDF5 Workshop at PSI

  15. External storage layout • Dataset raw data is stored in an external file(s) that should be kept together with the HDF5 file • Layout in the external file is specified by an application • An easy way to make legacy data available to HDF5 library HDF5 Workshop at PSI

  16. External storage layout Application memory Metadata cache (MDC) Dataset array data Datasetheader Unix/Windows file HDF5 file Datasetheader Metadata is stored in HDF5 file. Raw data is stored in a separate file as specified by application HDF5 Workshop at PSI

  17. Chunked storage layout • Chunking – storage layout where a dataset is partitioned in fixed-size multi-dimensional tiles or chunks • Each chunk is stored as contiguous block • HDF5 library treats each chunk as atomic object for I/O • Greatly affects performance and file sizes • Use for extendible datasets and datasets with filters applied (checksum, compression) • Use for sub-setting of big datasets HDF5 Workshop at PSI

  18. Chunked storage layout Applicationmemory Dataset array data Metadata cache (MDC) B C D A Datasetheader Chunkindex HDF5 File Datasetheader Chunkindex C D B A Raw data is stored in separate chunks in HDF5 file HDF5 Workshop at PSI

  19. Compact storage layout • Raw data is stored in a dataset object header • Raw data read/written with the header • Use for small (few K) datasets to minimize small I/O operations HDF5 Workshop at PSI

  20. Compact storage layout Applicationmemory Metadata cache (MDC) Dataset array data Datasetheader Datasetheader Dataset array data HDF5 File Raw data is stored in a dataset object header HDF5 Workshop at PSI

  21. Factors affecting I/O performance HDF5 Workshop at PSI

  22. HDF5 data structures • Data structures used by HDF5 library • B-trees (groups, dataset chunks) • Hash tables • Local and global heaps (variable length data: link names, strings, etc.) • Other concepts • HDF5 metadata cache • HDF5 chunk cache • Free space management data structure • Etc. HDF5 Workshop at PSI

  23. Operations on data inside HDF5 library • Copying to/from internal buffers • Datatype conversion, e.g., • Float to integer • Little-endian to big-endian • 64-bit integer to 16-bit integer • Variable-length data conversion from memory to file • Scattering - gathering • Data is scattered/gathered from/to application buffers into internal buffers for datatype conversion and partial I/O HDF5 Workshop at PSI

  24. Operations on data inside HDF5 library • Data transformation (filters, compression) • Checksum on raw data and metadata • Algebraic transform • GZIP and SZIP compressions • HDF5 and user-defined data transformations HDF5 Workshop at PSI

  25. I/O performance • I/O performance depends on many factors • Storage layouts • Dataset storage properties • Chunking strategy • Metadata cache performance • Datatype conversion performance • Other filters, such as compression • Access patterns HDF5 Workshop at PSI

  26. I/O with different storage layouts HDF5 Workshop at PSI

  27. Writing compact dataset HDF5 Workshop at PSI

  28. Writing compact dataset Applicationmemory Metadata cache (MDC) Dataset array data Datasetheader Datasetheader HDF5 File Raw data is written when object header is written HDF5 Workshop at PSI

  29. Writing contiguous dataset HDF5 Workshop at PSI

  30. Writing contiguous dataset Applicationmemory Metadata cache (MDC) Dataset array data Datasetheader Dataset array data Datasetheader HDF5 File Raw data is written first. The header is written when flushed to file (H5Dclose, H5Fflush, or MDC flush done by the HDF5 library) HDF5 Workshop at PSI

  31. Writing contiguous dataset with conversion Applicationmemory Metadata cache (MDC) Dataset array data Datasetheader 1MB conversion buffer Datasetheader HDF5 File Raw data goes through conversion buffer. The header is written when flushed to file (H5Dclose, H5Fflush, or MDC flush done by HDF5 library) HDF5 Workshop at PSI

  32. Partial i/o for contiguous dataset HDF5 Workshop at PSI

  33. Sub-setting of contiguous datasetSeries of adjacent rows Application data in memory M rows N One I/O operation M rows HDF5 File N elements Subset is contiguousin file HDF5 Workshop at PSI

  34. Sub-setting of contiguous datasetAdjacent, partial rows Application data in memory N elements M rows Several I/O operation M rows HDF5 File N elements Subset is in M contiguousblocksin file HDF5 Workshop at PSI

  35. Sub-setting of contiguous datasetExtreme case: writing a column Application data in memory M rows Several small I/O operation 1element 1 element HDF5 File Subset data is scattered in a file in M different locations HDF5 Workshop at PSI

  36. Sub-setting of contiguous datasetData sieve buffer Application data in memory Data is copied to a sieve buffer in memory (64K) memcopy M One write operation 1 element … HDF5 File HDF5 Workshop at PSI

  37. Performance tuning for contiguous dataset • Datatype conversion • Avoid for better performance • Use H5Pset_buffer function to customize conversion buffer size • Partial I/O • Write/read in big contiguous blocks • Use H5Pset_sieve_buf_size to improve performance for complex sub-setting • Caution: • Sieve buffer is allocated when the first write occurs and is released when the dataset is closed. • Memory will grow if there are a lot opened datasets. HDF5 Workshop at PSI

  38. i/o for chunked dataset HDF5 Workshop at PSI

  39. Recall: Chunked storage layout Applicationmemory Dataset array data Metadata cache (MDC) B C D A Datasetheader Chunkindex HDF5 File Datasetheader Chunkindex C D B A Raw data is stored in separate chunks in HDF5 file HDF5 Workshop at PSI

  40. HDF5 chunking • HDF5 library treats each chunk as atomic object • Compression is applied to each chunk • Datatype conversion, other filters applied per chunk • Chunk size greatly affects performance • Chunk overhead adds to file size • Chunk processing involves many steps HDF5 Workshop at PSI

  41. HDF5 chunk cache • Chunk cache (general points, details later) • Caches chunks for better performance; remains allocated across multiple calls • Created for each chunked dataset • Size of chunk cache is set for file(default size 1MB) • Each chunked dataset has its own chunk cache • Chunk may be too big to fit into cache • Memory may grow if application keeps opening datasets HDF5 Workshop at PSI

  42. HDF5 chunk cache Metadata cache (MDC) Datasetheader Metadata cache Default size is 1MB Chunking B-tree nodes Chunk caches (per dataset) Application memory HDF5 Workshop at PSI

  43. Writing chunked dataset Application memory space Chunked dataset Chunk cache Conversion buffer A C C B Filter pipeline HDF5 File B A C Datatype conversion is performed before chunked placed in cache Chunk is written when evicted from cache Compression and other filters are applied on eviction HDF5 Workshop at PSI

  44. Partial i/o for chunked dataset HDF5 Workshop at PSI

  45. Partial I/O for chunked dataset • Example: write the green subset from the dataset , converting the data • Dataset is stored as six chunks in the file. • The subset spans four chunks, numbered 1-4 in the figure. • Hence four chunks must be written to the file. • But first, the four chunks must be read from the file, to preserve those parts of each chunk that are not to be overwritten. 1 2 3 4 HDF5 Workshop at PSI

  46. Partial I/O for chunked dataset • For each of the four chunks: • Read chunk from file into chunk cache, unless it’s already there. • Determine which part of the chunk will be replaced by the selection. • Move those elements to conversion buffer and perform conversion • Move data elements to write from application buffer to conversion buffer • Move those elements back from conversion buffer to chunk cache. • Apply filters (compression) when chunk is flushed from chunk cache • For each element 3 memcopy performed HDF5 Workshop at PSI

  47. Partial I/O for chunked dataset Chunk cache memcopy memcopy Conversion buffer 3 memcopy Application memory Compress and write to file HDF5 File Chunk HDF5 Workshop at PSI

  48. i/o for variable-length dataset HDF5 Workshop at PSI

  49. Examples of variable length data • String A[0] “the first string we want to write” ………………………………… A[N-1] “the N-th string we want to write” • Each element is a record of variable-length A[0] (1,1,0,0,0,5,6,7,8,9) [length = 10] A[1] (0,0,110,2005) [length = 4] ……………………….. A[N] (1,2,3,4,5,6,7,8,9,10,11,12,….,M) [length = M] HDF5 Workshop at PSI

  50. Variable length data in HDF5 • Variable length description in HDF5 application typedefstruct { size_t length; void *p; }hvl_t; • Base type can be any HDF5 type H5Tvlen_create(base_type) • ~ 20 bytes overhead for each element • Data cannot be compressed HDF5 Workshop at PSI

More Related