Guide to Publishing Metadata in EUDAT's B2FIND Catalogue

B2FIND Integration How to publish metadata in EUDAT’s B2FIND catalogue Version 2 December 2015 This work is licensed under the Creative Commons CC-BY 4.0 licence

EUDAT: A truly pan-European Infrastructure EUDAT offers common data services to both research communities and individuals through a network of 35 European organisations. EUDAT enables European researchers from any discipline and any geographic location to preserve, find, access, and process data in a trusted environment. European infrastructures Technology Providers Research Communities

Community-Driven Solutions EUDAT services (the so called B2 Service Suite) are designed, built and implemented based on user community requirements. ENVIRONMENTAL SCIENCES BIOMEDICAL & MEDICAL SCIENCES MATERIALS & ANALYTICAL FACILITIES MAPPER SOCIAL SCIENCES & HUMANITIES PHYSICAL SCIENCES & ENGINEERING

B2 Service Suite

What is B2FIND? • B2FIND is based on a comprehensive joint metadata catalogue of research data collections stored in EUDAT data centres and other repositories • B2FIND provides a simple and user-friendly discovery service on metadata steadily harvested from a wide range of research communities

Why should you publish your metadata in B2FIND? • Make your research data • searchable, visible and accessible to the public • popular in a cross-disciplinary and international scope • Improve interoperability and re-use of your data • Allow feedback and annotations on your research output • Benefit from validation, quality assurance and added value of your meta data

Data from a huge selection of subjects • B2FIND has a truly cross-community approach • Metadata is mapped and offered covering a wide range of communities • From climate research to Social Sciences • From Biodiversity to Linguistics • From Archaeology to Seismology Transformation and homogenisation of the catalogue allows use of a common vocabulary

B2FIND communities • B2FIND comprises initially communities in the EUDAT registered domain of data, which provide a well-described and stable metadata offers. • EUDAT is extending the service to other interested and reliable data and metadata providers • The list of currently integrated communities is available at http://b2find.eudat.eu/group/

What will be covered • How get your metadata published in B2FIND ? • Metadata Generation • Metadata repository and provider • Metadata Harvesting • Metadata Formats (excerpt) • Metadata Mapping • B2FIND MD Schema (excerpt) • Metadata Validation • Support requests • Appendix: OAI-PMH - What it is and how it works

How to get your metadata published in B2FIND? - The Metadata (MD) Ingestion Roadmap MD Generation Data Provider on Community site MD Repository and Provider MD Harvesting Service Provider on EUDAT site MD Mapping and Validation MD Uploading and Indexation

Metadata Generation • Must be done in close proximity to the data production • should be part of the data management plan • must be checked and possibly enhanced to aim for a comprehensive data description • benefits from quality control at an early stage • should be based on common ontologies and metadata formats

Metadata repository and provider • To be set up on community site to allow harvesting • OAI-PMH is the preferred protocol (for a detailed description of the protocol and an installation guide of the data provider tool see the Appendix) • But as well other data transfer techniques are supported, if necessary • EUDAT offers support for the installation

Metadata Harvesting B2FIND harvests regularly and incrementally from OAI endpoints • Initially the B2FIND team will do a first harvest try on a given and accessible OAI endpoint • The frequency and the harvested sets will be negotiated with the community

Metadata Formats (excerpt)

Metadata Mapping The community specific ‘raw’ metadata are processed and homogenized to B2FIND schema in the following steps • Parse harvested XML records and select entries by MD format specific XPATH rules • Analyse and parse values and map onto key-value pairs (JSON) vs. given controlled vocabularies • Check and validate the resulting JSON records against B2FIND schema • Use (community specific) ontologies and thesauri

B2FIND Metadata Schema (excerpt)

Metadata Validation Check each field for coverage, consistency and validity • ‘Technical’, e.g.: • Check date-time vs. UTC format • Check spatial coverage by geonames.org and consistency of lat/lon coordinates • Semantic mapping • using controlled vocabularies • using ISO standards, e.g. iso639 library for ‘Language’ • Online checks • of links to the data objects (‘Source’, ‘PID’ and ‘DOI’)

Support requests www.eudat.eu/support-request?service=B2FIND

b2find.eudat.eu • For more info: https://eudat.eu/services/b2find • User documentation: https://www.eudat.eu/services/userdoc/b2find-integration

Appendix OAI-PMH: What it is and how it works • OAI-PMH ( http://www.openarchives.org ) • stands for Open Archives Initiative Protocol for Metadata Harvesting • aims at world-wide consolidation of scholarly archives • enables free access to the archives (at least: metadata) • is a low-barrier mechanism for repository interoperability • consists in a set of six verbs or services that are invoked within HTTP • provides consistent interfaces for data and service provider • allows effortless implementation • is based only on a few simple protocols (HTTP, XML, DC)

Data/Service Provider setup

Basic functioning of OAI-PMH Metadata Harvester Service Provider Metadata (Documents) Data Provider Requests (based on HTTP) Metadata (encoded in XML) EUDAT Metadata Catalogue • „Services“, e.g. • Search • Access • Commenting • … LocalMetadata Storage

OAI benefits • Interoperability: it is by no means domain specific and based on common metadata schemas • Widely used: It’s a quasi standard tool for providing metadata, for registered data providers (more than 2800 repostitories worldwide) see e.g. at https://www.openarchives.org/Register/BrowseSites • Simple to install: In the appendix we offer a guideline of the software joai. See the list of tools implemented by members of the Open Archives Initiative community at https://www.openarchives.org/pmh/tools/tools.php • Simple to use: OAI attached great importance to simplicity of the protocol

OAI shortcomings • Inefficiency: The XML serialisation and deserialisation takes time. • Reference clash issue: if two records happen to have the same ID value, the envelope is not valid XML. • Persistence of deletion: OAI-PMH allows three levels of persistence, but most providers promise none. • Lack of SSL: By a strict reading OAI-PMH standard supports only http: , but not https:

Software for OAI-PMH • jOAI software (http://www.dlese.org/dds/services/joai_software.jsp ) • is a Java-based data provider and harvester tool • is from open source Open Archives Initiative • runs in a servlet container such as Apache Tomcat • enables existing systems, archives and databases • to provide metadata via OAI-PMH and • to harvest metadata to the file system.

Installation overview • To install and run the jOAI software you must have the following: • oai.war- the jOAIsoftware. • Apache Tomcat v5.5.x or v6.x. • Java Standard Edition (SE) (or JDK) version 6. • For details see the OAI-PMH tutorial at http://www.oaiforum.org/tutorial/

Data provider • Configuration and customisation can be done directly in the jOAI data provider site: • Setup and configuration •  Data Provider •  Setup and status •  Repository Information and Administration • Add metadata by adding directories of files •  Metadata Files Configuration •  Add metadata directory • (Re)index added/changed dierectories .. • (optional): Set configuration, Access control, …

OAI-PMH Harvester – Verbs and Parameters Verbs that specify the service being invoked • Identify- used to retrieve information about the repository. • ListIdentifiers- used to retrieve record headers from the repository. • ListRecords- used to harvest full records from the repository. • ListSets- used to retrieve the set structure of the repository. • ListMetadataFormats- lists available metadata formats • GetRecord- used to retrieve an individual record from the repository. Selective harvesting by parameters • identifier- specifies a specific record identifier. • metadataPrefix- specifies the metadata format of the returned records • set- specifies the set that returned records must belong to. • from/until– returns records created/update/deleted after/before this date • resumptionToken- a token to resume a request where it last left off.

An example of an OAI Provider and Harvester

Guide to Publishing Metadata in EUDAT's B2FIND Catalogue

Guide to Publishing Metadata in EUDAT's B2FIND Catalogue

Presentation Transcript

Integration

Integration

Integration

INTEGRATION

INTEGRATION

Integration

Integration

Integration

Integration

Integration

Integration

Integration

Integration

INTEGRATION

Integration

INTEGRATION

Integration

Integration

Integration