1 / 36

SPG Recommended File Formats and Metadata

SPG Recommended File Formats and Metadata. Richard Ullman SPG. Describing data is challenging. 5. Classic Data Model. 19115 ? Geographic Information — Metadata. HDF5. STANDARD. 5. HDF5. STANDARD. 5. HDF5 is A scientific data model

cyrah
Download Presentation

SPG Recommended File Formats and Metadata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SPG Recommended File Formats and Metadata Richard Ullman SPG

  2. Describing data is challenging ES-DSWG New Orleans

  3. 5 Classic Data Model 19115? Geographic Information — Metadata ES-DSWG New Orleans

  4. HDF5 STANDARD 5 ES-DSWG New Orleans

  5. HDF5 STANDARD 5 ES-DSWG New Orleans • HDF5 is • A scientific data model • An API (C or Fortran, may link to other languages) • File format • HDF5 data model provides structures for almost any kind of scientific data structure or collection of structures. • Dataset • A multidimensional array of data of a specified type • Group • A hierarchical structure element, groups may contain groups or datasets • Attribute • A descriptive element attached to Groups or Datasets. • HDF5 is developed, maintained and serviced by The HDF Group, as open source and may be used at no cost. • NASA is only one federal agency that actively supports The HDF Group, a not-for-profit, through contracts or grants.

  6. HDF5 Strengths and Applicability STANDARD ES-DSWG New Orleans • Strengths • Portable cross platform data and library. • Supported by may applications. • HDF, HDF-EOS data formats, software libraries and APIs have been widely used for NASA’s mission data for many years. • HDF5 is the current or planned data format for several NASA missions, totaling many 10s of terabytes of data • Simple data model elements (group, dataset, attribute) can represent complex data relationships. Examples: images, tables, multidimensional arrays, composite records, and user defined data types. • Self describing using API. • Designed for high performance computing and I/O. • Tuning including compression, chunking, bit-packing • Data users read only the data they need, not the whole file • Applicability: • Used for production, archive, distribution, analysis

  7. HDF5 Weaknesses and Limitations STANDARD ES-DSWG New Orleans • Weaknesses: • Potentially complex and unfamiliar to general science users • However: • Quality of documentation and help desk support. • Third-party tools with HDF5 support help hide complexity • Growingly familiar to NASA Earth observation data users. • Limitations: • Loss of backwards compatibility with HDF4 and earlier versions. Though HDF5 was released in 2000, ten years later several NASA missions are still actively producing data using HDF4.

  8. HDF-EOS5 STANDARD ES-DSWG New Orleans

  9. HDF-EOS5 STANDARD ES-DSWG New Orleans • HDF for EOS is a software library • Implements a PROFILE layer on top of HDF. • HDF-EOS is provided for HDF4 and HDF5, but the HDF5 version was explicitly reviewed. • Specific data structures are constructed from standard HDF5 data objects [SWATH, GRID, ZONAL AVERAGE, and POINT] • A key feature is a standard association between geolocation data and science data through internal structural metadata. The relationship between geolocation and science data is transparent to the end user using the API. • Instrument and data type independent services such as subsetting by geolocation, can be applied to files across a wide variety of data products through the same library interface. • HDF-EOS is developed and maintained for the ESDIS project as part of the EOSDIS Core System by Raytheon Company

  10. HDF-EOS5 Strengths and Applicability STANDARD ES-DSWG New Orleans • Strenghts • Widespread use of HDF-EOS by NASA. • Many terabytes and thousands of users • HDF-EOS5 inherits the benefits of HDF5 (see above). • Using HDF-EOS API is much easier than using HDF5 directly • HDF-EOS library enforces adherence to a specific HDF profile. By using the HDF-EOS library, developers create files which have a specific format. • The EOS API hides most of the differences between the lower-level HDF4 and HDF5 implementations, allows easy user migration. • Source code for writing and reading data in the format is publicly available. • Because HDF-EOS5 is a profile of HDF5, files are readable by all HDF5 library and tools that support HDF5. • Applicability: • Suitable for archive, distribution or as an analysis format.

  11. HDF-EOS5Weaknesses, and Limitations STANDARD ES-DSWG New Orleans • Weaknesses: • Potentially complex and unfamiliar to general science users • However: • Quality of documentation and help desk support. • Third-party tools with HDF5 support help hide complexity • Growingly familiar to NASA Earth observation data users. • HDF-EOS5 layer consists of multiple libraries maintained by different organizations. • Profile is not a panacea, mission implementers do not always implement data products in the same way • Limitations: • HDF-EOS5 necessarily lags behind the parent HDF5 format. • HDF-EOS5 is not supported by many third party applicatoins such as IDL and Matlab. However, HDF-EOS5 data can be read with the HDF5 interfaces that are more frequently supported. • HDF-EOS does not allow for parallel I/O

  12. netCDF Classic Classic Data Model ES-DSWG New Orleans

  13. netCDF Classic STANDARD Classic Data Model ES-DSWG New Orleans • NetCDF • Array-oriented scientific data model • A collection of libraries implementing access to data model • API supports creation, access, and sharing of scientific data. • Self-describing, portable., direct-access, appendable • Classic model: • Variables hold data values • Dimensions are used to specific variable shapes, common grids, and coordinate systems. A dimension has a name and length • Attributes contain information about properties of a variables or an entire data set. • netCDF is developed, maintained and serviced by Unidata under NSF funding

  14. netCDF ClassicStrengths and Applicability STANDARD ES-DSWG New Orleans • Strengths • Fosters data interoperability and exchange through its self-describing file format, platform independent architecture, and robust access methods • Overall file format and metadata attributes are simple enough to be easily understood yet robust enough to describe and store many kinds of Earth science data. • Wide use among NASA stakeholder community within NASA and among NSF and NOAA investigators. The netCDF user community is international in scope. • Applicability: • Science data widespread use in the atmospheric, oceanic and climate modeling sciences with 10s of terabytes of data and thousands of users

  15. netCDF ClassicWeaknesses and Limitations STANDARD ES-DSWG New Orleans • Weaknesses: • Classic model’s simplicity means that it may be difficult for some complex datasets to be represented. Though in practice, this appears to be rare. • No support for internal compression of data variables • Limitation on size of arrays (about 2 GB) • No support for 64 bit integers • Limitations: • Limitations on file sizes and absence of internal compression

  16. ICARTT Data Format ES-DSWG New Orleans

  17. ICARTT Data Format STANDARD ES-DSWG New Orleans International Consortium for Atmospheric Research on Transport and Transformation Developed to fulfill the data management and broad collaborative research needs for the ICARTT campaign in 2004. Text based file format and composed of a header section (metadata) with critical data description information (e.g. data source, uncertainties, contact information, and brief overview of measurements technique) and a data section Originally designed for airborne data, the ICARTT format proved practical for other mobile and ground-based studies and various data types.

  18. ICARTT Data FormatStrengths and Applicability STANDARD ES-DSWG New Orleans • Strengths • Easy to use standard approach to share airborne instrument data sets • Proven to facilitate broad collaborative scientific research among the airborne measurement and atmospheric chemistry modeling and satellite communities. • Widely accepted and used by NASA, NOAA, British, French, and the German airborne science programs. • Availability of tools to verify conformance to this file format and to visualize the data. • Format is valuable for assuring interoperability between different user groups without regard to the sensor performing the measurements • Applicability: • Airborne field campaigns

  19. ICARTT Data FormatWeaknesses and Limitations STANDARD ES-DSWG New Orleans • Weaknesses: • As an ASCII based format, it is not as efficient binary formats. • Not suitable for large 3-dimensional data sets • Tools are not as widely available as the preceding standards. • Limitations: • Not suitable for large data sets. • File format includes metadata but the metadata files themselves do not follow any other standards that may limit its use in large automated systems. • Not as quick to load as binary files. • ASCII files do not allow random access to data

  20. Aura File Format ES-DSWG New Orleans

  21. Aura File Conventions NOTE ES-DSWG New Orleans • Describes a PROFILE of the HDF-EOS5 PROFILE of HDF5 • File naming convention • Grouping structure per HDF-EOS5 • Dataset names, data types and dimension order. • Attribute names and types at file, group, and dataset levels • Adopted by all four instrument on the EOSAura satellite (HIRDLS, MLS, OMI, and TES) and can be used by other atmospheric chemistry instruments. • Developed collaboratively by the Aura instrument science teams • Aura teams have provided these conventions in the hope that future atmospheric chemistry missions can use this technical note to enable their users to benefit from a common data and file format. They can build upon the guidelines as described here in which case they will benefit from being able to easily read Aura data files.

  22. Aura File ConventionsStrengths and Applicability NOTE ES-DSWG New Orleans • Strengths: • Comprehensive documentation of the data file format and organization agreed to and implemented by the Aura Instrument teams including: major HDF-EOS version; organization of geolocation and data files and attributes; data file names, data types and dimension ordering; units for geolocation and data files; attribute names, values, and units • Applicability • Designed for Environmental Remote Sensing Satellite Atmospheric Observation datasets. Could be extended to other datasets using similar principles.

  23. Aura File ConventionsWeaknesses and Limitations NOTE ES-DSWG New Orleans • Weaknesses: • Units specified in the RFC do not conform to SI conventions for representation, and in some cases, there are inconsistencies in units among the different data fields • Limitations: • The profile, though developed for interoperability, are specific to the Aura observatory. Future missions will need to extend the guidelines for unique datasets.

  24. GCDM DIF ES-DSWG New Orleans

  25. GCDM DIF STANDARD ES-DSWG New Orleans • The Global Change Master Directory (GCMD) Directory Interchange Format (DIF). • Specific set of metadata attributes describes datasets. • The DIF defines dataset entries in GCMD, one of the largest public metadata inventories in the world. • The GCMD’s primary responsibility is to maintain a complete catalog of all NASA’s Earth science data sets and services.  The project also serves as one of NASA’s contributions to the international Committee on Earth Observation Satellites (CEOS), as the NASA node of the CEOS International Directory Network (IDN). • Using DIF helps to “normalize” the search for data sets through the use of several alternative search engines.

  26. GCDM DIFStrengths and Applicability STANDARD ES-DSWG New Orleans • Strengths: • Mature specification and widely used. The DIF is actively used and maintained by a large variety of organizations. • DIF metadata model is compatible with the ISO 19115 metadata model. • Has been a NASA standard since 1994 • Applicability: • Describes directory or “collection” level metadata for Earth science data sets. It is not designed for use as “granule” or “inventory” metadata. • DIF is primarily a metadata interchange format. Data holders often manage metadata in different internal formats.

  27. GCDM DIFWeaknesses and Limitations STANDARD ES-DSWG New Orleans • Weaknesses: • While the DIF is compatible with ISO 19115, there is a feeling among some metadata providers that the DIF should be superseded by ISO 19115. • Data producers must populate DIF independently of ECHO or their own data inventory. This extra step is seen by some as a weakness. • Limitations: • The DIF is a “directory” rather than an “inventory” ECHO provides both directory and inventory metadata so ECHO may supersede DIF.

  28. ECHO Data Model ES-DSWG New Orleans

  29. ECHO Metadata Model NOTE ES-DSWG New Orleans Earth Observing System (EOS) Clearinghouse (ECHO) Metadata Model RFC defines the metadata requirements and recommendation for ingesting earth science metadata into the ECHO system. 3 metadata constructs are utilized by the ECHO system: collection, granule, browse For each metadata type, the minimum metadata files required to validate against the ECHO Ingest schema are outlined. In addition, a list of recommended metadata fields that should be included in data ingested into ECHO is provided.

  30. ECHO Metadata ModelStrengths and Applicability NOTE ES-DSWG New Orleans • Strengths: • ECHO is highly successful, operational, enterprise-level metadata repository with over 3000 data collections and about 87 million data granules. Currently contains metadata holdings exported by all NASA DAACs. • ECHO metadata model has heritage from the well established EOS Data and Information System (DIS) Core System (ECS) data model and is robust and able to represent most NASA’s Earth science data types • Applicability • The ECHO metadata model has been optimized for NASA remote sensing data having grown from the EOSDIS Core System (ECS) data model.

  31. ECHO Metadata ModelWeaknesses and Limitations NOTE ES-DSWG New Orleans • Weaknesses: • Interaction with ECHO is somewhat arcane and requires a concerted systems development effort • Limitations: • ECHO may be less suited to other NASA data types currently being catalogued, such as in-situ acquired in field experiments • While ECHO is understood to work, the community of reviewers identified that it is a NASA internal standard and not likely to be adopted outside of NASA. Reviewers suggest that NASA should be adopting the international ISO 19115 series of standards instead. THIS IS WHY ECHO REMAINS A SPG TECHNICAL NOTE AND NOT AN SPG RECOMMENDED STANDARD.

  32. CF Metadata Conventions ES-DSWG New Orleans

  33. CF Metadata Conventions STANDARD ES-DSWG New Orleans Climate and Forecast (CF) Metadata Conventions are dataset level metadata developed to promote interoperability among data providers, data users, and data services by providing a clear and unambiguous standard for representing geo-locations and times of Earth science data, physical quantities that the data represent, and other ancillary information useful in interpreting the data or comparing it with data from other sources.

  34. CF Metadata ConventionsStrengths and Applicability STANDARD ES-DSWG New Orleans • Strengths: • CF conventions are used across multiple Earth Science disciplines at many data centers with data volumes in the 10s of petabytes • Software libraries and tools exist to support the CF conventions. • Many open software and commercial data visualization clients make use of the CF attributes. • Consensus approach bridging across several Earth system modeling communities. • Applicability • Climate, Forecast and Earth observation field data (aka, parameter, dataset, or variable data) especially when stored in netCDF.

  35. CF Metadata ConventionsWeaknesses and Limitations STANDARD ES-DSWG New Orleans • Weaknesses • CF conventions originally tied to netCDF, implementation conventions are not yet well established for other formats • CF conventions originally developed to handle gridded model data; conventions for other types of data such as point and trajectory data are less complete. • The standard name process is difficult and appears disorganized • The units attribute in the DF conventions make use of the udunits package which is not described very well • CF conventions work well for 2-dimensional data but not for 3-dimensional data • Limitations • It is unclear how to implement CF in formats other than netCDF. • Geo-location method is not always most efficient. • CF conventions metadata are not for catalog or inventory

  36. ISO 19115 Metadata 19115 Geographic Information — Metadata ES-DSWG New Orleans

More Related