1 / 35

CLARIN Metadata & ISO DCR

CLARIN Metadata & ISO DCR. Daan Broeder. Max-Planck Institute for Psycholinguistics. TKE ES05 Workshop, August 14’th Dublin. CLARIN Project.

Download Presentation

CLARIN Metadata & ISO DCR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14’th Dublin

  2. CLARIN Project • The CLARIN project is a large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily useable for Language & SSH (Social Sciences & Humanities) researchers. • CLARIN EU project and different national CLARIN projects • CLARIN EU WP2 since 2007 investigated and creates (prototypical) solutions for: • Common AAI infrastructure • Single system of persistent identifiers (PIDs) for resources • Common metadata domain • …

  3. Current Metadata Situation Fragmented landscape • Metadata sets, schema & infrastructures in our domain: • IMDI, OLAC/DCMI, TEI • Problems with current solutions: • Inflexible: too many (IMDI) or too few (OLAC) metadata elements • Limited interoperability (both semantic and functional) • Problematic (unfamiliar) terminology for some sub-communities. • Limited support for LT tool & services descriptions

  4. Metadata Components CLARIN chose for a component approach: CMDI • NOT a single new metadata schema • but rather allow coexistence of many (community/researcher) defined schemas • with explicit semantics for interoperability How does this work? • Components are bundles of related metadata elements that describe an aspect of the resource • A complete description of a resource may require several components. • Components may contain other components • Components should be designed for reusability

  5. Metadata Components Lets describe a speech recording Sample frequency Format Size Technical Metadata …

  6. Metadata Components Lets describe a speech recording Language Name Id … Technical Metadata

  7. Metadata Components Lets describe a speech recording Actor Name Age Language Sex Language … Technical Metadata

  8. Metadata Components Lets describe a speech recording Continent Location Country Address Actor … Language Technical Metadata

  9. Metadata Components Project Name Lets describe a speech recording Contact … Location Actor Language Technical Metadata

  10. Metadata Components Project Lets describe a speech recording Location Actor Metadata schema Language Technical Metadata Metadata profile

  11. Metadata Components Project Lets describe a speech recording Location Actor Metadata schema Language Technical Metadata Metadata description Metadata profile

  12. Metadata Components Project Lets describe a speech recording Location Profile definition XML Actor Metadata schema W3C XML Schema Language Technical Metadata Component definition XML Metadata description Metadata profile XML File

  13. Location Text Actor Recording BirthDate Country Language CreationDate Name Dance MotherTongue Title Type Type Coordinates CMDI Component Reuse Component registry User selects appropriate components to create a new metadata profile or an existing profile user Selecting metadata components from the registry

  14. Location Text Actor Recording CreationDate Country BirthDate Language Name Dance Type Type MotherTongue Title Coordinates CMDI Explicit Semantics Component registry User selects appropriate components to create a new metadata profile or an existing profile user Semantic interoperability partly solved via references to ISO DCR or other registry Country dcr:1001 ISOcat concept registry Language dcr:1002 BirthDate dcr:1000 DCMI concept registry Title: dc:title Selecting metadata components from the registry

  15. Text 2 Text 1 Recording CreationDate Language Language Name Dance Title Title Type Type Relation Registry User selects or creates a profile that specifies relations between DCs Component registry User MD search Genre2 Genre1 ISOCat dcr:1020 = dcr:1030 Genre 1 dcr:1020 Language dcr:1002 dcr:1020 ~ dcr:1030 Genre2 dcr:1030 dcr:1020 > dcr:1030 Metadata modelers or terminology expert can also use the RR to specify relations that the ISO DCR can’t store Relation Registry

  16. Search Service ISOcat Concept Registry Semantic Mapping Relation Registry CLARIN Component Registry/Editor DCMI Concept Registry Joint Metadata Repository other Concept Registry Metadata Repository Metadata Repository CMDI Metadata Live-cycle Perform search/browsing on the metadata catalog using the ISO DCR and other concept registries and CLARIN relation registry Create metadata schema from selection of existing components. Allow creation of new components if they have references to ISOcat Metadata harvesting by OAI-PMH protocol Metadata descriptions created Metadata component profile was selected from metadata component registry

  17. CMDI: Browsing the Component Registry

  18. CMDI: Editing a Component

  19. MD Components & Semantic Granularity • Problems with component metadata: too high granularity in the ISOCat • Actor.Name, Actor.Fullname, Actor.Address, Actor.email,… • Creator.Name, …, Creator.email,… • Funder.Name, …,Funder.email • Having a DC for every of these MD elements would explode the ISOcat. Using just generic “Name” loses precision. • Compromise: use fine granularity only for elements that are expected to be often used (CreatorName, ActorName) for searching in metadata. Map the rest to generic “Name” • More fundamental solution: Use container concepts: • create an “Actor” DC, then we can reason with the context. • Actor ~ Participant, Name ~ Fullname -> Actor.Fullname ~ Participant.name

  20. Metadata Thematic Domain • DCs to describe Language Resources & Technology • Chair: Peter Wittenburg, MPI for Psycholinguistics • Started entering data 2009 based on two expert meetings in Athens and guided by: • existing metadata sets: IMDI, OLAC/DC and • inventory: ENABLER • resulted in: 218 DCs • Translation work was initiated via the CLARIN national coordinators. (15 language sections for “audio file format”) • Dutch CLARIN metadata project 2010 added: • 76 new DCs (of which 30 still private)

  21. Some experiences I • The GUI is not too fast • Need a discussion platform to discuss a DC’s attributes. (now solved with the forum function) • UI arrangement. For instance the value domain attributes are not in one panel. (type, data type, value domain • Metadata terms often needed to be linked to DCs that are either too broad or too narrow. (Situation did not merit a new DC). • Search for existing DCs is only effective if you know the terminology.

  22. Some experiences II • Duplicate entries (e.g. “source”). Entry was made before check was in place. • illogical or unsystematic definitions: DC-2512: The name of the person who was participating in the creation project.DC-2454: The name of the person that can be contacted to get access to the resource or to the tool/service.DC-2505: The address of an organization that was/is involved in creating, managing and accessing resource or tool/service.DC-2521: The email address of a person or an organization that is involved in creating, managing or accessing resources or tools/services.DC-2459: The organization that was leading the creation project or that is responsible for accessing the resource and the contact person is affiliated with.DC-2461: The telephone number of a person or an organization that is involved in creating, managing or accessing the resource.

  23. Next Steps for Metadata TD • Expected standardization: October 2010 • Before that we will reexamine all DCs • Build jury -> vote • DCR board -> vote • What happens then? The DCs will get new PIDs • But we have metadata records where the PIDs of “old” DCs are used • Curate. Update the metadata records • Redirect (if owner agrees) to standardized version • Make use of Relation registry: old_DC == new_DC • ?

  24. Thank you for your attention CLARIN has received funding fromthe European Community's Seventh Framework Programmeunder grant agreement n° 212230

  25. WS05: Standardizing Data Categories in ISOcat: Implementing Group Work for Thematic DomainsWS05: Standardizing Data Categories in ISOcat: Implementing Group Work for Thematic Domains

  26. CMDI Architecture I • Division into: • MD Producer components • MD Exploitation or consumer components • OAI-PMH components • Knowledge components: DCR, Relation Registry • The CMDI takes an archivist or “production” first viewpoint • Prioritize that the metadata can be of good quality: consistent, coherent, correctly linked to the concept registries • The consumer side can be more “experimental” and diverse. • Many MD exploitation “stacks” or consumers can work in parallel on the same metadata

  27. Concept registries • Basically a list with concepts and their descriptions where every concept has a unique identifier. • Some have a complicated structure and are associated with elaborate (administrative) processes to determine the status and acceptation of concepts in the registry. e.g. ISO-DCR. • others are static and simple lists of concepts and descriptions e.g. DCTERMS

  28. ISO DCR • ISO-DCR is important for more CLARIN objectives then metadata and is under control of the linguistic community (ISO-TC37) • is an implementation of the model defined in ISO 12620 , offering a GUI and programming APIs • Every DC Is subject to a standardization process and carries information on the status of that process • Metadata is just one of 13 Thematic Domains in the DCR • Can contain no relations between the DCs, only a value domain relation is possible.

  29. CMDI Architecture II ISO TDG MD Catalog Virtual Collection Registry Metadata modeler ISO-Cat DCR user MD Services Semantic mapping Services MD Comp. Editor MD Comp. Registry Relation Registry MD Editor. External agents MD Creator Local MD Repository CLARIN Joint MD Repository OAI-PMH Service Provider OAI-PMH Data provider

  30. Current CMDI status I • ISO-DCR: 218 metadata concepts • CMDI component registry: 135 components, 19 profiles Produced & inspired by: • Deconstructing existing metadata schema IMDI, OLAC, TEI • Considering requirements of other CLARIN activities like profile matching • CLARIN NL metadata project tested the CMDI model and delivered components and profiles for the resources in two major Dutch Language Resource centers

  31. Current CMDI status II Operational or test phase: • ISOCat DCR • Component registry & editor • ARBIL metadata editor Still working on: • Joint Metadata Repository, Metadata Catalog, Semantic Mapping, Relation Registry Expect a usable first version in third quarter 2010

  32. CMDI contributors Collaboration on the CMDI implementation • MPI for Psycholinguistics: metadata modeling and editing facilities • Språkbanken, University of Gothenburg: Joint CLARIN metadata repository • Austrian Academy: Metadata catalog, metadata & semantic mapping services • IDS: Virtual Collection Registry • MPG / CLARIN NL: ISO-DCR • DFKI: Relation Registry

  33. Common metadata domain Why a common metadata domain: • Finding and sharing resources housed at all archives & repositories participating in CLARIN • Specify distributed heterogeneous collections of LRs and processing these collections • In general, a common metadata domain helps bringing along a single domain of LRs

  34. Text 2 Text 1 Recording CreationDate Language Language Name Dance Title Title Type Type Relation Registry User selects or creates a profile that specifies relations between DCs Component registry MD search user Genre2 Genre1 ISOCat dcr1020 = dcr1030 Genre 1 dcr:1020 Language dcr:1002 dcr1020 ~ dcr1030 Genre2 dcr:1030 dcr:1020 > dcr:1030 Relation Registry MD modeler

  35. Text 2 Text 1 Recording CreationDate Language Language Name Dance Type Type Title Title Relation Registry Component registry Genre2 Genre1 ISOCat Genre 1 dcr:1020 Language dcr:1002 Genre2 dcr:1030

More Related