lxml. METADATA. HARVESTING METADATA FROM THREDDS CATALOGS. Nebojsa Balic.
HARVESTING METADATA FROM THREDDS CATALOGS
The fast and unimpeded access to the data and associated metadata produced in the context of the IPCC Assessment Report (AR5) is among the main objectives of the IS-ENES project. Confronted with the challenge of fulfilling this task, the team consisting of the staff members from the Max-Planck Institute of Meteorology and German Climate Computation Center in Hamburg is engaged in a cooperative effort to develop a data portal featuring tools for an efficient retrieval of metadata. A crucial role in this endeavor plays adopting and utilizing new technologies for data dissemination. One of such technologies is the thredds server, designed to immensely simplify the discovery and use of data stored on disparate hosts and accessible with different types of services. Thredds servers generate catalogs, which are XML documents containing names of datasets, their access points and metadata as well as the pointers to respective subsets. The poster illustrates the workflow of the harvesting metadata from thredds catalogs and their integration in the IS-ENES portal based on the Plone CMS platform.
Developed PYTHON SCRIPTS parse the catalogs, collect metadata and store them in the database. The metadata can then be browsed, queried and represented in this way in the data portal.
Metadata are harvested either periodically or after each update of thredds catalogs. They can be accessed even when catalogs are not available.
THE PARSING OF THREDDS CATALOGS using the lxml library consisted in identifying the links to referenced catalogs, included datasets and their metadata.
The access to metadata stored in the database allows a faster and more flexible access than to those in thredds catalogs. The metadata can be navigated down through their hierarchy and across the data nodes that provide them. They can also be queried in terms of their various properties such as temporal or spatial extension. These functionalities coupled with the services for the direct data access constitute the central part of the IS-ENES DATA PORTAL.
The DATA MODEL reconciles the hierarchical structure of the data archives with users‘ information requirements. However, it mainly reflects the XML Schema of thredds catalogs and can be easily modified as such for the harvesting any type of metadata provided in the thredds catalogs.
MPI-M / DKRZBundesstraße 45aD-20146 Hamburg
Project web site: https://is.enes.org/