1 / 22

File Formats, Conventions, and Data Level Interoperability

File Formats, Conventions, and Data Level Interoperability. ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion. Introduction & overview. Outline of objectives: Discuss role of standard, self-describing “File formats” in data level interoperability

orpah
Download Presentation

File Formats, Conventions, and Data Level Interoperability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. File Formats, Conventions,and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion

  2. Introduction & overview • Outline of objectives: • Discuss role of standard, self-describing “File formats” in data level interoperability • Summarize common file formats in use, their properties, & benefits --“data life cycle economics” • Discuss criteria for choosing a file format, matching it to needs of consumer/producers. • Discuss critical role of Conventions – any file format needs good recipes to make them interoperable! • Examples: NASA Measures F/T, SMAP, AIRs, Aura

  3. Role(s) Of File Formats in Interoperability • File formats represent versatile “packages” for multi-dimensional science data and metadata. • Offer self-describing “well-known structures” to codify desired, common conventions and practices. • Offer well-documented reference cases to encapsulate specific data models. • Standard file formats dock with format-aware tools to offer users a seamless end-to-end experience and platform portability • Enhance Mission-to-Mission continuity

  4. …investment  life-cycle economics…

  5. Why (and how) are file formats important? • Standard formats • Come with thorough documentation • Provide good Reference implementations • Common formats • More datasets in a format  more tools that read that format • Canonical structures and names  general purpose handlers for coordinates, etc.  smarter tools

  6. A generic work flow… • Consider user community needs and culture, fit within architecture, institutional policies & preferences • Choose a standard file format (or sub-variant) • Design a convention-enabled, specific internal layout with metadata interfaces • Prototype: Implement in prototype, evaluate • Implement in production context • Integrate within discovery and catalog environments (Catalog interoperability…)

  7. Examples of standard file formats • HDF5 – a file format on its own, as well as a broad foundation for others • netCDF v4 (stable at v4.1.1, newest : v4.1.2-beta1) • v4 Classic (widespread adoption, some limitations…) • v4 Enhanced (support Groups, User-defined, variable length types, and more) • netCDF v3 Classic (legacy+ , tools+, but limited) • HDFEOS2, HDFEOS5 – EOS Terra, Aqua, Aura… • HDF4 – legacy, extensive use by MODIS Terra, Aqua • Many other domain-specific, less generic formats abound… (need transform tools to/from HDF?)

  8. Some selection criteria… • Do file-format’s capabilities support required functionality? • What is breadth of acceptance, adoption within larger community? (and/or, does institutional policy dictate a specific format?) • Presence and quality of documentation (reference, examples and especially tutorials), API software, and community support? • Contribution to investment, data life-cycle economics? • What is the level of standardization? • Adaptability of format to widely used conventions like CF 1.x, or other accepted convention(s)?

  9. Internal Layout / Design(once format is chosen & adopted…) • Define &refine High level organization /structure • /DATA • /METADATA • Distinguish ‘data’ from ‘metadata’, core structure vs. ‘attributes’ • Dimensions, Coordinate Variables, projection attributes • Missing_data, _Fillvalue vs. internal fill value • Units, Gain, offset, min, max, range, etc. • Prototype it! • Leverage script environments (Python H5Py, PyTables, etc) • Panoply, HDFView also quick, useful for prototyping, feedback

  10. Using “Groups” • HDF5 (and NetCDF v4-Enhanced) support full use of groups e.g. /DATA vs. /METADATA, etc. • Groups useful in partitioning out functionally related sets of data or attributes; Hierarchical view mimics file-system • Facilitates appropriate information-hiding, highlights needed info, shield other (principle of least privilege…) • Well supported by modern tools (Panoply, HDFViews, PyTables, H5Py) and low-lev APIs.

  11. Example(s) of File Formats In Action • HDF5 – NASA Measures • NASA Measures Freeze/Thaw (soon available at NSIDC) • http://measures.ntsg.umt.edu/sample_2007_day180.zip • AQUA AIRS Level 2 (from earlier talk): • http://airspar1u.ecs.nasa.gov/opendap/Aqua_AIRS_Level2/AIRX2RET.005/2010/285/AIRS.2010.10.12.090.L2.RetStd.v5.2.2.0.G10286064818.hdf • Aura TES (TES-Aura_L3-CH4_r0000002135_F01_05.he5)

  12. Example: NASA Measures Freeze/Thaw, Daily in HDF5 Metadata Block: Attributes

  13. Example: NASA Measures Daily Freeze/Thaw in HDF5 Data Variable (FT_SSMI) and Attributes

  14. Example: NASA Level 2 AIRS (Swath) in HDF4

  15. Example: NetCDF, (tos) Sea surface temperatures collected by PCMDI for use by the IPCC, illustrating CF v1.0 layout

  16. Example: TES (HDFEOS5) illustrating CF v1.0 layout

  17. CF Conventions & file formats:--how they contribute to interoperability. • CF v1.4.x -- the term “CF” is now broader than just climate-forecasting! • Standard Name Table -- a step towards wider adoption of names, controlled vocabularies, units terminology • CF v1.4.x provides tool-makers with helpful “lingua-franca” guidance. • Within a file-format, adopting conventions like CF promotes common layout, names, semantics, for dataset-to-dataset compatibility -- a key to wider data level interoperability.

  18. Attributes vs. Metadata?one man’s ceiling is another man’s floor… • Collection level vs. Data Set vs. Granule level • Structural vs. science-content • Swath vs. grid vs. point • Commonly used attributes: • CONVENTIONS attrib, communicates which convention was used • Basic globals: title, history, institution, source, references • Coordinate variables, axis, formula_terms • Units, _Fillvalue, missing_data, valid_range • Short_name, long_name, other provenance • (gain,offset /scale_factor,addOffset), etc.

  19. Challenges? (just a few remain…) • Evolution, bifurcation, asymmetric support can result in occasional user confusion: • HDF v1.8.x vs. v1.6.x families? • NetCDF v4 Enhanced vs. NetCDF v4 Classic vs. v3? • HDFEOS5 vs. HDFEOS2? • Both GUI tool and API support tends to vary by platform (Linux, Mac, Win7) and sub-flavor… • Multi-library dependency stacks beg for fully bundled, version-matched end-to-end install pkg! • Conventions community (CF v1.4.x) and metadata standards communities also in motion (but that’s good too…)

  20. Resources : URLs • Climate Forecast (CF) Conventions (now at 1.4.x): • http://cf-pcmdi.llnl.gov/ • http://cf-pcmdi.llnl.gov/documents/cf-conventions • HDF: • http://www.hdfgroup.org/HDF5/doc/index.html • HDFEOS • http://www.hdfgroup.org/hdfeos.html • http://hdfeos.org/software/aug_hdfeos5.php • NetCDF: • http://www.unidata.ucar.edu/software/netcdf/ • http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html • General: • http://www.oceanteacher.org/OTMediawiki/index.php/Self-Describing_Formats • http://en.wikipedia.org/wiki/List_of_file_formats

  21. Resources: File format related Tools • Panoply: http://www.giss.nasa.gov/tools/panoply/ • HDFView: http://www.hdfgroup.org/hdf-java-html/hdfview/ • OpenDAP: http://opendap.org • IDV: http://www.unidata.ucar.edu/software/idv/ • McIDAS: http://www.unidata.ucar.edu/software/mcidas/ • Python: • h5py : http://code.google.com/p/h5py/, http://h5py.alfven.org/, • PyTables: http://www.pytables.org/moin • Perl: PDL-IO-HDF5, and Biohdf? • Many others: HEG, MTD, HDFEOS plug-in for HDFview, HDFLook, (ncdump, h5dump, and cousins), GRADS, Matlab, binary APIs

  22. A provisional DOI, UUID Strategy • What we used for NASA Measures Freeze/Thaw, daily (v2) just delivered: • DOI: assigned to our reference paper, by IEEE Transactions in Geoscience and Remote Sensing • UUID recipe, seedString = www.our.url/GranuleName/Datetime8601Stamp Import uuid uuid= uuid.uuid5(seedString)

More Related