Research Metadata: Challenges and Solutions in Information Management

Metadata in Research Information Introduction: “Tour de table”Ed Simons, President of euroCRIS

Structure of the Presentation • Introduction of the Speaker • “Tour de Table”: scope of this introduction. • Nature and importance of research metadata. • Some challenges to meet regarding the realization of optimal solutions for Research Information Metadata. • Conclusions.

Introduction of the Speaker • Working at Radboud University, NL: Central “Concern Information Management” Department. • Former Head of this Department and since some years working as “International IT-project Manager” (especially IT-projects in developing countries). • Initiator and project leader Dutch Research Information System (CRIS) “METIS”. • Development of Dutch CRIS alreadystarted in 1992 withaninteruniversitytaskgroupdefining a datamodel for research informationcalled “Combi-format”, a kind CERIF “avant-la-lettre”. • First version of METIS implemented in 1993: one of the first fully-fledged CRIS systems in Europe.

Radboud University, Nijmegen, NL.

Radboud University, Nijmegen, NL • Nijmegen • Oldest city in the Netherlands: celebratedits 2000th anniversary in 2005. • Regiment city of the Roman Empire, located in the East of the country on the banks of the Rhine, - whichmarked the Northern border of the Roman Empire - where the river enters the NetherlandsfromGermany. • Home of the biggestwalkingevent – and since 2012 officially even the biggestsportsevent - in the World: the “4-days Marches” (held everyyear in July) withsome 45.000 participantsfromallover the World.

Radboud University, Nijmegen, NL • Founded in 1923 as “Catholic University Nijmegen”, name changed some 10 years ago. • About 19.000 students and 10.000 staff (5.000 f.te.) of whichabout half academic. • Middle-sizeduniversityaccording to Dutch standards: 5th or 6th out of 13 universities. • All faculties, including an academic hospital, except engineering. • Strong in research: • Brain research. • Physics: home of a “High Field Magnet”, one of the most powerful in the world, which serves as a national research equipmentfacility. • (Psycho)-linguistics: the Max PlanckInstituteforPsycholinguistics, a prominent player in the RDA-community, is locatedon campus.

Radboud University (University of Manchester) Both worked at Radboud University for some years, where they started the research that led to the development of the material “graphene” for which they received the Nobel Prize. Novoselov got his Ph.D. from Radboud and Geim is still a Visiting Professor.

“Tour de Table”: Scope of the introduction • Presenting a “Tour de Table” involvestalkingabout: • What is on the table • Who is sittingaround the table. • Applied to our subject of “Research Metadata”: • What issues are at stake concerning Research Metadata. • Who are the key players involved. • Focus will be on the first aspect, key players involved will come up “automatically” while dealing with these core issues.

“Tour de Table” • A “Tour de Table’ byits nature canonlybeglobalorgeneral. To use the foodmetaphor: • Deals with the dishes on the table, but does not go into a detailedtreatment of the ingredients of the various dishes or the cookingtechnique to produce the food. • So the presentationwillbeonefrom a “Bird’seye” view, and more particularly of course the view of a “euroCRISbird”. • Focus not so much on technology, but more on non-technical aspects and issues that are of importance when it comes to the realization of optimal solutionsconcerning Research Metadata. • (obviously) Focus on “domain-related” metadata so metadata about the subject matter (research information) itself as distinguished fromdomain-agnostic, formal metadata (adiminstrative, technical metadata, rigths metadata..). • Guiding or underlying question: what major challenges exist in the Research Information Domain, that should be dealt with in order to be able to create optimal, sustainable solutions regarding research metadata?

Why is Research Metadata important? As subject forthiseuroCRISSeminar? Research Metadata is “The Bread we Bake” The development of a research metadata model (CERIF) is a core activity of euroCRIS and at the heart of our Mission.

Why is Research Metadata important? As suchfor (the field of) Research Information? “Carefully crafted metadata results in the best information management —and the best end-user access— in both the short and the long term.” “Quality metadata creation is just as important as the care, preservation, display, and dissemination of collections; adequate planning and resources must be devoted to this ongoing, mission-critical activity” From: Tony Gill, et.al., Introduction to Metadata, Paul Getty Institute: http://www.getty.edu/research/publications/electronic_publications/intrometadata/index.html

What are Metadata? • “Data about data” • In a wayanappealingdefinition: • Short and simple to remember • Basicallysays “it all” what Metadata is about • Canberead in bothdirections • Complies to the (popular) tripplestructure. • But requires already a certain familiarity with the matter • to really grasp the full meaning.

What are Metadata? Various definitions exist of (domain-related) meatadata, such as: “Metadata is structured data which describes the characteristics of a resource.”An Introduction to Metadata by Chris Taylor, University of Queensland “Metadata are structured, encoded data thatdescribe characteristics of information bearing entities to aid in the identification, discovery, assessment, and management of the described entities” Ma, J. Managing metadata for digital projects “Metadata are data identifying, describing or characterizing information objects or resources, and their relations, aimed at supporting discovery, access to and use of these information objects/resources” Ed Simons

What are Metadata? • From a more pragmatic point of view: • Research metada: “Data about research”, more specified: • Data about objects or resources, and their relations, in the research information domain” • Research Informationobjects/resources: • IndividualActors(persons): researchers, managers, policy makers, auditors… • Organizational units: research institutions, fundingorganisations, publishers, … • Activities: projects ... • Input: f.t.e, money • Output: publications, patents, otherproducts • Equipmentused • Servicesused / produced • Datasets used / produced • Metrics and indicators • …. • So research metadata are data about these information objects/resources (that identify, describe, characterize) • and the relations between the objects/resources.

What are Metadata for? “metadata results in the best information management —and the best end-user access” “… aid in the identification, discovery, assessment, and management…” “... aimed at supporting discovery, access to and use...”

What are Metadata for? • So metadata are of help or necessary for: • Finding and (the ablility to) access or obtain research information. This includes the aspects: • Discovery of research information objects/resources. • Provenance, administrative and technical data a (e.g. format or type) of the information object. • Conditions of (re)-use • Userrights and security. • ... • The Management and use of research information (use cases). This regards the aspects: • Research policy (formulation, execution and evaluation, on various levels) • Planning of research • Management of research (steering, monitoring) • Performancemeasurement • Impactmeasurement • Presentation and communication of (information on) research (to/by various stakeholders) • ... • “Carefully crafted metadata” are needed for an optimal implementation or execution of these aspects .

Metadata requirements. I n order to optimally support the activities mentioned, metadata must be: • Complete • Correct • Up to date • Accurate • Unambiguous • Detailed (enough) • Reliable • Secure • Sustainable

Types (typologies) of Metadata. Various typologies of classifications of Metadata exists, e.g. the distinction: • Descriptive or content-related metadata: describing/characterizing the intellectual content. • Administrative, Technical metadata: e.g. file formats. • Rights metadata: regulating authorization, permissions. • Provenance metadata: creation, subsequent versioning or treatment. • Structural metadata: e.g. internal structure of items, page order… • Context metadata: relating information objects to their “environment” or context. E.g. Projects to funders, institutions; Publications to authors, publishers, reviewers, etc… • Usage metadata: information about the use of the information object (nr. of downloads. Requests,..) Another one (by Keith Jeffery): • Schema metadata: controlling the integrity of the described data • Navigational metadata: the access path to the data • Associative metadata: • Descriptive: content-related, context • Restrictive: rights, authorization • Supportive: conditions of use, constraints, …

Metadata models and formats In order to be put to use in information systems or applications, metadata are described and organized in metadata models. A working definition of a metadata model could be: A structured set of concepts that define the information objects in a given business domain, their identification, properties and relations as well as their meaning within the context of the model itself, including possible constraints that may exist regarding (values, use of) elements of the model. So a metadata model concerns the information objects themselves as well as their metadata as such. E.g. the concept “Person” (information object) is part of a metadata model, but in itself is no metadata. It further defines the information objects in terms of the model e.g. : • “A Person is an individual actor within the research domain”. • “Person is a first level entity” in the CERIF model.

Metadata models and formats A genuine metadata model is structured, meaning it has a certain architecture involving the relations between the elements of the model (analoguous e.g. to a “domain model” in software development). A metadata model may be implemented in a system (e.g. a relational database) and expressed in (one or more) formats (e.g. CERIF-XML). Examples: CERIF, DCMES, MODS, MARC21, VIVO-ontology, etc…

The CERIF Metadata Model

The CERIF Metadata Model • Broad coverage (aspects) of research information: metadata on researchers, projects, organizations, output (publications and other products), input (f.t.e, money), funding, equipment, services, metrics, impact, dataset metadata (as from version 1.6) and the (inter)relations between these elements. • Detailed: highly normalized. • Well thought-out architecture based on an optimal use of the relational model and with as a basic principle: expressing properties and semantics of the information objects and their relations by means of time-stamped links (linking entities) instead of as attributes of the entities. This makes the model extremely flexible and scalable, since any number of links can exist between information objects in the model.

The CERIF Metadata Model • Some examples: • A researcher can have various roles at the same time in a project or affiliations to an organisational unit. Even various roles from various typologies or in various languages. • Any number of classifications can be used for the same publication (various controlled vocabularies). • The same principle of linking entities can be used to map controlled vocabularies to one another. • And: various levels of granularity can be expressed and registered for the same kind of metadata. E.g. the role of a researcher concerning a publication can both be expressed according to the low-grained DC (creator, contributor) or by means of a more fine-grained classification (1st author, author, editor, editor-in-chief, reviewer, executer of the experiment, etc...).

role=author1-institute role=editor role=... ? role=author role=author1 role=reviewer role=... ? role=deliverable1.2 role=journal article role=public report role=CEO role=researcher role=project-manager role=funder role=investigator role=member role=coordinator role=manager The CERIF Metadata Model

The CERIF Metadata Model The broad coverage and appropriate architeture, just mentioned, make CERIF-CRIS a powerful interoperability instrument or “engine” and a perfect candidate as “one stop registration store” of research metadata. Theoretically one could say that “all you need” is registration of your research metadata in a CERIF-CRIS. But practice is different: CERIF is not alone in the world and still is unknown to a lot of stakeholders in the international research information domain. And various other valuable applications and developments exist that “will be there to stay” and with which the CERIF-CRIS community has to live together. This brings us to the challenges ahead of us on our way to more optimal solutions concerning research information and its metadata.

Challenges to reach metadata “Nirvana” • The “Tower of Babel” syndrome • A multi-cultural world • A tale of two ecosystems • Transatlantic differences • The “Success of the Web Paradox” • The human factor • The imperfect being • It’s all about time and money • The imperfect organization • The (big) data deluge

The Tower of Babel Syndrome The Tower of Babel. • A Biblical story (Genesis) about the survivors of the Great Flood who wanted to build a Tower that would (by means of which one could) reach heaven. • This angered God, who confounded their languages so that they could not understand and communicate with each other anymore. As a result their project collapsed and the tower was left unfinished. Two “all time lessons” to be drawn from this: • It’s difficult to reach optimal solutions if you “speak” in different languages, formats, models, etc… • Communication and cooperation between various parties or stakeholders involved is needed for an optimal result.

The Tower of Babel Syndrome The Tower of Babel metaphor may well be applicable to the field of research information metadata. Strolling around for a while in the research metadata domain, you may encounter the following kind of experience.

The Tower of Babel Syndrome A plethora of metadata models and formats exists within the research information domain, both concerning “generic” aspects (i.e. metadata applicable to all disciplines)as well as “discipline- or subject-specific” metadata (controlled vocabularies that hold content- or aspect-specific classifications related to a given scientific discipline or research subject e.g. the MeSH-classification for Medical Sciences). GBIF MAGE NEXUS CIF EGMS MARC21 PREMIS CKAN DCAT DA|RA VIVO PRISM MODS DDI TEI DCMES CERIF GILS LCS Darwin Core SOIF ROADS/IAFA journalpublishing3 MESH FGDC IAD ISAD(G) SPECTRUM LCNAF ITIS HIVE TGN NBII UBio INSPIRE

The Tower of Babel Syndrome According to the Tower of Babel Syndrome it seems that we are doomed and will never reach research information “Nirvana”. However, things have changed in the 3.000 years since the Tower of Babel was built and humanity has made some progress. The “Babel Builders” did not have the Web nor did they have automatic translation facilities. But we have today. In other words: we now have tools to realize interoperability between the various models and formats in order to solve our language problem. So there’s a first challenge for euroCRIS: to create crosswalks between CERIF and other metadata models or formats existing within the research information domain.

The Tower of Babel Syndrome • euroCRIS has taken up this challenge: • Agreement on and creation of a CERIF-OpenAireinteroperability solution, based on CERIF-XML in cooperation with the OpenAire community. • Realization of a mapping CERIF-VIVO, again a joint project between the two organisations. The first version of this mapping is now ready for endorsement by the Boards of both organizations. • Within the EU-financed project “ENGAGE” aimed at making public governmental data available on the web, in which euroCRIS is a partner, a crosswalk has been created between (metadata elements of) CERIF on the one hand and CKAN and DCAT on the other. • Within the C4D (CERIF for Data) project, a first mapping has been done between CERIF and INSPIRE (EU-project: Infrastructure for Spatial Information in the European Community). • A project has been started up in cooperation with Elsevier to “CERIF-y” the Snowball Metrics Metadata (Project of UK-universities and Elsevier to develop a set of benchmark metrics for institution’s research performance).

The Tower of Babel Syndrome • The Tower of Babel metaphor not only points to the necessity of creating interoperability solutions (translations of concepts) but also to the more Business-related aspect of the need for international standard definitions of key Business Objects and aspects in the research information domain. It is not of much use to match terms if the content of the terms does not match. • In this respect euroCRIS is working closely together with CASRAI, the Canadian based “Consortia Advancing Standards in Research Administration Information”, which is (a.o.t.) developing a standard dictionary of research information concepts.

The Tower of Babel Syndrome An aspect that may not remain unmentioned here is that the plethora of different formats, often related to disciplinary differences or boundaries between scientific disciplines, not only causes problems for those active in the research information domain, but may well hamper research itself, especially in a time when research more and more becomes multi-disciplinary: “The proliferation of discipline-specific metadata schemes contributes to artificial barriers that can impede interdisciplinary and transdisciplinary research.…. These barriers, frequently associated with metadata semantics and data structures, interfere with scientific progress along multidisciplinary, interdisciplinary, and trans-disciplinary lines. On the whole, the barriers can interfere with progress supporting our contemporary understanding of science.” (Willis, Greenberg, White, Analysis and Synthesis of Metadata Goals for Scientific Data, preprint)

A tale of two ecosystems • Within the research information domain, two major “ecosystems” exist with their own culture, tradition, visions on and approaches towards research information and research information metadata. • These are: • (research) Administrative ecosystem • Library ecosystem. • Both ecosystems up to now often (still) behave like “silos” without much communication, sharing of visions or cooperation and this on all levels, whether local (within an institution), national or international. • The table on the next slide shows a comparison of both ecosystems on some significant aspects:

A tale of two ecosystems

A tale of two ecosystems • Major challenge: to integrate or harmonize the two communities in the search for optimal solutions in research information. • This can be promoted by: • Adapting an open mind towards each others expertise, experience and solutions. • Being present at each others events and organizing joint events. • Starting up joint projects. • Formalize the exchange of information (newsletters, announcements) • Organizational integration of both communities (departments) on an institutional (university), national (e.g SURF in NL) and international level (joint, coordinating structure: to be created).

A tale of two ecosystems • euroCRIS has picked up this challenge, e.g. by: • As early as 2004 (euroCRIS Conference in Antwerp), inviting the library/ repository community to euroCRIS events. • Cooperation within the framework of the CERIF-OpenAire interoperability format. • Inviting (very recently) a well-known expert from the repository community to join the eruoCRIS Board (which he – luckily - has accepted from this month on). • But the progress made in this respect is a two-way street, e.g.: • The CERIF-OpenAire project clearly was a joint initiative. • The Italian Consortium of Universities (CINECA), in cooperation with the University of Hongkong has worked out CERIF-compatible metadata extensions for the Repository software package DSPACE. • So it is good to see that the two communities are more and more aware of each others existence and value and are growing towards and learning from each other.

A tale of two ecosystems • As said both communities should learn from each other. A concrete example in this respect could be the following. • Within the CRIS community the word “Profile” (researcher profile, CERIF-profile, application profile, etc…) has recently come up frequently, but without the concept being clearly and unambiguously defined or structured. In this respect, I think, there is something to learn from the library/DC-community in their dealing with the concept “DC Application Profile”, and more notably the so-called “Singapore Framework”. This framework defines a structure for a profile and its constituting elements, as follows: • An application Profile consists of: • Functional requirements (what the application is for or does) • Domain model (which information elements play a role in the application) • Description set profile (describes the metadatarecords for an object of the domain – and its properties) • Usage guidelines • Encoding syntax guidelines • I think it could be a useful inspiration or exampele for the CRIS-community.

A tale of two ecosystems Some challenges that remain: CERIF and the added value it holds, still could be more promoted and made known in the library ecosystem. The still existing distant and reluctant attitude of the research community towards the “administrative ecosystem” still in a way hampers the acceptance and introduction of CERIF outside this ecosystem. (Aside: in this respect, and with a bit of exaggeration one could say that maybe the best model (CERIF) – sadly enough - has been developed in the wrong ecosystem.) So, some work needs to be done to correct the negative image that exist of CRIS within the research community. For this it is good to have an understanding of what causes the researcher’s discontent.

A tale of two ecosystems • In my view, the following 3 reasons are the main causes for the distance and discontent of the researchers towards the administrative ecosystem and its systems: • Administration and its organizational structures is considered to be of a “lower status” and a “necessary evil”. Researchers want to do research and not be bothered by administrators. • Registration of research information (metadata) in an administrative information system (CRIS) is boaring, and creates a lot of overhead at the expense of valuable research time lost. • Various organizations ask the same information in different formats with different definitions and at different moments. • Challenges: to remove these “problems”. The first one is psychologic and will only be removed if (at least) the other two have been solved.

A tale of two ecosystems • Registration of research information (metadata) in an administrative information system (CRIS) is boaring, and creates a lot of overhead at the expense of valuable research time lost. • Already to a large extent dealt with by the CRIS-developers: most modern CRIS include automated harvesting of metadata already existing in other resources (local: HRM, projectmanagement systems, as well as international: WoS, Scopus, MedLine, etc...). This diminishes the time to invest by researchers dramatically. Challenge: optimally inform the researchers that these services exist as part of a CRIS. • Various organizations ask the same information in different (metadata) formats, with different definitions for the same concepts and at different moments. • This requires agreements and streamlining of information request processes between the organizations involved on a supra-local (national) level. Differences between countries in level of achievement. Challengefor euroCRIS (and others): to advocate the need for coordination and agreement; to knock on the doors of the organizations it concerns, to give advice and show best practices.

Transatlantic differences • A variation on the previous theme. Developments in IT and Web technology are traditionally dominated by US-based organizations and enterprises, as a result of which IT applications in a given domain to a certain extent reflect the US organization and culture of the domain in question. • Applied to research information: US universities and research institutions, compared to Europe, have different ways of e.g. financing, organizing and evaluating research and thus have different views on applications to support these processes and the metadata involved, compared to Europe. • This may also hamper the implementation of global, standard solutions in the research information domain or lead to the implementation of “domain-incongruous” applications on both sides of the Atlantic. • In this respect, I can mention a concrete example from within our university, in the student information domain.

The “Success of the Web” paradox. • The success of the World Wide Web may in some way rather hamper instead of promote the implementation of optimal solutions in the research information domain. • The (no-doubt justified) enthusiasm among important stakeholders in the research information domain (e.g. research community, politicians) about possibilities of the Web and its technologies as promoted by Web Science and W3C communities may hold the danger of an uncritical identification of the “Web technology solution” as “the best possible solution”. • This “overwhelming belief in the Web” may cause a certain blindness for (the integration of) other technologies, and hence may lead to sub-optimal solutions. Significant and illustrative in this respect is the following remark: “However, in the open data process very little attention is paid to metadata.” (Jeffery, Zuiderwijk, Janssen, The potential of metadata for linked open data and its value for users and publishers) • Challenge: continue to advocate and demonstrate the (added) value of RDBMS-technology and its implementation in the CERIF model and CERIF-CRIS.

The human factor: the imperfect being • Humans are not perfect: they forget, make errors, lack discipline or even just lie! • This may result in substantial problems regarding metadata, since whenever “bio-curation” is applied (metadata registration by humans - as is still the case in the majority of cases), the above aspects come into play and may lead to incomplete, not in time, not up to date, incorrect or unreliable metadata. • Example from the Dutch situation (but probably a universal phenomenon): the supply of metadata by and within the universities, needed for the yearly research evaluation process by the government is often only done at the very last moment. • Challenge: to develop and perfect automated metadata creation and registration.

Research Metadata: Challenges and Solutions in Information Management

Research Metadata: Challenges and Solutions in Information Management

Presentation Transcript

Metadata : Promise and Practice

Digital Preservation for Digital Repositories

The CRIS Landscape of the Czech Republic

The Role Of Metadata

Metadata in NIR

CERIF-CRIS Overview

Wednesday!! Tour of the Periodic Table

Curves Ahead! A Sunday Driver’s Tour of Metadata, Metasearch, and Open Linking

CRIS and DAREnet integrated into NARCIS: access to research information in the Netherlands

Workshop on Research Metadata in Context – Sept 7/8, Nijmegen, Netherlands

NOAA National Coastal Data Development Center

Publishing research information as Linked Data Proposal of Recommendations

EuroCRIS

Introduction to Metadata for Digital Asset Management

Cleaning Metadata

Metadata towards an e-research cyberinfrastructure

Applying a metadata standard for international weather information

The OLAC Metadata Set

Metadata and the Semantic Web

Metadata

Research Data Management Activity

Discovery Metadata