1 / 30

Menzo Windhouwer The Language Archive – DANS tla.mpi.nl menzo.windhouwer @dans.knaw.nl

Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry . Menzo Windhouwer The Language Archive – DANS tla.mpi.nl menzo.windhouwer @dans.knaw.nl. The Language Archive. Founded in September 2011 Supported by MPG, BBAW and KNAW (DANS)

meriel
Download Presentation

Menzo Windhouwer The Language Archive – DANS tla.mpi.nl menzo.windhouwer @dans.knaw.nl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collaboratively DefiningWidely Accepted Linguistic Data Categoriesin the ISOcat Data Category Registry Menzo Windhouwer The Language Archive – DANS tla.mpi.nl menzo.windhouwer@dans.knaw.nl eHg - New Trends in e-Humanities

  2. The Language Archive • Founded in September 2011 • Supported by MPG, BBAW and KNAW (DANS) • Grown out of the Technical Group at the MPI forPsycholinguistics • Since 1990ies: challenge of archiving digital data • 2000 – 2016 VolkswagenFoundation DOBES project on Endangered Languages • Active in many European infrastructure projects: CLARIN, EUDAT, DASISH, … eHg - New Trends in e-Humanities

  3. Language Archiving Technology • Full lifecycle support • Core: resources • Key: metadata • ‘New’: CMDI, ISOcat, AV recognition, … • Archive size: • 70 Tb of resources • 22.000 hours AV recordings • 75.000 sessions (metadata) • 5 million annotated segments • 50 lexica • My focus: Knowledge Systems • LEXUS, an online lexicon tool • ISOcatandcompanions eHg - New Trends in e-Humanities

  4. Typological Database Nijmegen TOP NOTION tds:Noun GROUPS{ NOTION tdn:GrammaticalDistinctions LABEL "Grammatical distinctions for nouns." GROUPS { NOTION tdn:AgentNouns LABEL "Agent nouns." DESCRIPTION "Nouns can function as the agent of a clause." LINK TO CONCEPT agentRole GROUPS { NOTION tdn:v098_plusAffix LABEL "Agent nouns formed by verb stem plus affix." LINK TO CONCEPTS (agentRole, verbalMorphology, boundAffix) DESCRIPTION <p>Agent nouns are formed by a verb stem plus an affix, e.g. English <qv>walk-er</qv>.</p> NOTE AUTHOR IS "TDS" TYPE IS "original TDN label" "AGENT NOUNS ARE VERB STEM PLUS AFFIX" IS FIELD v098; ... Explicit semantics! Notes: TDN is not in archived in TLA, but curated in TDS, a previous project I worked on, and now archived at DANS; also this not a TDN punchcard eHg - New Trends in e-Humanities

  5. DOBES corpora Shared semantics! Explicit semantics! eHg - New Trends in e-Humanities

  6. Oxford English Dictionary eHg - New Trends in e-Humanities Source: http://www.oxford-royale.co.uk/news/2010/12/04/new-online-edition-of-oxford-english-dictionary.html

  7. Terminology Community of Practice • Community started out on paper (A5 fiches), just like OED • 80’s - 90’s projects to standardize data category, the ‘fields’ on the fiches/in the files/database records, names • ISO 12620:1999 Data Categories a companion standard to ISO 12200 Machine-readable terminology interchange format (MARTIF) eHg - New Trends in e-Humanities

  8. ISO 12620:1999 eHg - New Trends in e-Humanities

  9. Towards a Data Category Registry • Problems with ISO 12620:1999 a hardcoded list of data categories • Not easily extensible • Ordering heavily debated • Outdated and limited in range at the moment of release • Developments • In the SALT project an interchange model (TBX) based on MARTIF/data categories was created, which was widely adopted • ISO 11179 Metadata Registries was released, which describes the standardization of data element concepts for metadata • ISO released Annex ST Standards as databases, which describes an ISO procedure to standardize registry entries • In the LIRICS project a pilot Data Category Registry, SYNTAX, was created eHg - New Trends in e-Humanities

  10. ISO 12620:2009 • Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources • A data model for data category specifications inspired by ISO 11179 • A procedure to standardize data category specification compliant with Annex ST • Each data category gets a unique Persistent Identifier (PID) • The Max Planck Institute for Psycholinguistics is appointed as the Registration Authority of the ISO/TC 37 DCR • In use by a growing number of ISO TC 37 standards • Lexical Markup Framework (LMF) • Linguistic Annotation Framework (LAF) • Morph-syntactic Annotation Framework (MAF) • … • could be more, e.g., Feature System Declarations (FSD) eHg - New Trends in e-Humanities

  11. Example Data Category specification • Data category: /Grammatical gender/ • Administrative part: • Identifier: grammaticalGender • PID: http://www.isocat.org/datcat/DC-1297 • Descriptive part: • English definition: Category based on (depending on languages) the natural distinction between sex and formal criteria. • French definition: Catégorie fondée (selon la langue) sur la distinction naturelle entre les sexes ou d'autres critères formels. • Linguistic part: • Morposyntax conceptual domain: /masculine/, /feminine/, /neuter/ • French conceptual domain: /masculine/, /feminine/ eHg - New Trends in e-Humanities

  12. Standardization procedure Decision Group Submission group Thematic Domain Group Data Category Registry Board Stewardship group Evaluation Validation rejected rejected Publication eHg - New Trends in e-Humanities

  13. Thematic Domain Groups TDG 1: Metadata TDG 2: Morphosyntax TDG 3: Semantic Content Representation TDG 4: Syntax TDG 6: Language Resource Ontology TDG 7: Lexicography TDG 8: Language Codes TDG 9: Terminology TDG 11: Multilingual Information Management TDG 12: Lexical Resources TDG 13: Lexical Semantics • TDGs are the owner and guardians of a coherent subset of the DCR • TDGs own one or more profiles • Each TDG has a chair • A number of members assigned by SC P members • A number of expert members invited by the chair (up to 50%) • TDGs are constituted at the TC37/SC plenary • NewTDGs need to be proposed by a SC • Translation • (Sign language) eHg - New Trends in e-Humanities

  14. ISOcat - the ISO TC 37/DCR • A (coherent) set of Data Categories, in our case for linguistic resources • A system to manage this set: • Create and edit Data Categories • Share Data Categories, e.g., resolve PID references • Standardize Data Categories • An API for tools to access the DCR • Grass roots approach • Anyone can access the DCR and use or create the data categories (s)he needs eHg - New Trends in e-Humanities

  15. Refering to ISOcat data categories • PIDs of data categories can easily embedded in XML documents <lmf:LexicalEntry> <tei:f name="partOfSpeech" dcr:datcat="http://www.isocat.org/datcat/DC-1345" fVal="commonNoun” dcr:valueDatcat="http://www.isocat.org/datcat/DC-1256"/> <lmf:Lemma type="Form"><tei:f name="writtenForm” dcr:datcat="http://www.isocat.org/datcat/DC-1836" fVal="clergyman"/> </lmf:Lemma> </lmf:LexicalEntry> • Also embedding in other formats is possible, e.g., via comments • Preferably annotate schemas, so a whole range of resources is annotated in one go eHg - New Trends in e-Humanities

  16. A glimpse of ISOcat eHg - New Trends in e-Humanities

  17. Collaboration in ISOcat • Registered user can contact eachother via mediated email • Ask the owner if a data category can be adapted a little to your needs • Registered users can start up a group and invite other users to join • Work together on a set of data categories • Interact via a public and/or private forum • A group can submit data categories for ISO standardization eHg - New Trends in e-Humanities

  18. Component MetaData Infrastructure • CMDI is developed by CLARIN and on its way to standardization by ISO TC 37 • Limitations existing metadata schemas: DC/OLAC, IMDI, TEI header • Inflexible: too many (IMDI) or too few (OLAC) metadata elements • Limited interoperability (both semantic and syntactic) • Problematic (unfamiliar) terminology for some sub-communities. • Limited support for LT tool & services descriptions • The idea is to address this by: • Explicit defined schema & semantics • User/project/community defined components eHg - New Trends in e-Humanities

  19. CMDI architecture ISOcat metadata catalogue component registry & editor metadata curator metadata curator metadata creator metadata modeler metadata user Relation Registry metadata editor search & semantic mapping Joint metadata repository Local metadata repository OAI-PMH Service provider OAI-PMH Data provider DATA eHg - New Trends in e-Humanities

  20. Athens Core • Bootstrapped the Metadata data categories selection in ISOcat • Based on existing metadata standards, e.g., DC, OLAC, IMDI, TEI • Many translations in european languages • Users add the data categories they need to the Metadata profile and use them in CMDI eHg - New Trends in e-Humanities

  21. CMDI architecture ISOcat metadata catalogue component registry & editor metadata curator metadata curator metadata creator metadata modeler metadata user Relation Registry metadata editor search & semantic mapping Joint metadata repository Local metadata repository OAI-PMH Service provider OAI-PMH Data provider DATA eHg - New Trends in e-Humanities

  22. CMDI architecture ISOcat metadata catalogues (VLO, MI) component registry & editor metadata curator metadata curator metadata creator metadata modeler metadata user Relation Registry metadata editor search & semantic mapping Joint metadata repository Local metadata repository OAI-PMH Service provider OAI-PMH Data provider DATA eHg - New Trends in e-Humanities

  23. CMDI (intermediate) results • Diverse metadata profiles • Center or projects create specific ones, but reuses components where possible • Shared and explicit semantics help to overcome • Terminological differences • Differences in structure • Future • Get more context sensitive • e.g. documentation language vs. speaker language • Crosswalks • equivalent metadata data categories are easilyintroduceddueto the open nature of ISOcat • User specific relationships • e.g. theory specific differences can be more important to one user then another eHg - New Trends in e-Humanities

  24. Metadata TDG • Standardization efforts of the Metadata TDG stalled • Large overlap with the work/people at the Athens-Core meetings • Community level agreement is maybe enough • Activity motivation should not depend on one person, the TDG chair, only • The need for explicit and shared semantics is not clear enough yet … more evangelization needed • Unfamiliarity with the work • Terminologists are more used to this kind of review work • Online review vs. old ISO ‘paper’ process • Members have little time, it is difficult to sync schedules • TDG experts tend to be senior scientist • Continuous process vs. sporadic bursts of activity • Unpaid work • Project funding vs. wide acceptance in the community • However, a project might bootstrap a thematic domain • The same problems hold for other TDGs • Current tendency to tie data category (selection) standardization to a new/revised standard, e.g., MAF and TBX • Redesign of the standardization process is coming up • ISO is not actively supporting Annex ST Standards as Databases anymore eHg - New Trends in e-Humanities

  25. Community efforts • LMF-related: UBY, RELISH/GOLD • Sign Language • CLARIN • CMDI, Athens Core • CLARIN-NL/VL • Call 1 – 4 projects created CMDI and annotated resources/schemas • ISOcat content coordinator: Ineke Schuurman • Tutorials, guidelines (do’s and don’ts) and feedback • Better community support in ISOcat • Views, e.g., CLARIN-NL/VL • Recommended by, e.g., DC-4949 • … eHg - New Trends in e-Humanities

  26. Conclusions and future work • Communties can already create a coherent view on ISOcat • the CMDI use case shows potential • maybefunder support needed to bootstrap specific domains • The standardized core will take (a long) time • like all standardization work • Next tometadataalso content • explicit semanticswouldbeprofitable even whennot shared and/or usedfor resource discovery • resources createdwith tools that support ISOcatwillcreatesuch resources more easy • Companion registries: • relations between data categories (RELcat) • annotated schemas for language resources (SCHEMAcat) • interactionwith the CLARIN vocabulary service (CLAVAS) • Data categories vs. concepts eHg - New Trends in e-Humanities

  27. Detour: ISOcat and LOD/Semantic Web • Archives and infrastructures look at the resources as they are, i.e., in general no conversions to triples • However, ISOcat data categories can easily be used in RDF resources :partOfSpeechdcr:datcat <http://www.isocat.org/datcat/DC-396> ; rdfs:label "part of speech"@en ; rdfs:comment "A category assigned to a word based on its grammatical and semantic properties."@en . • The Relation Registry, which is a tripple store, will in general support lightweight, semi-formal ontologies M. Windhouwer, S.E. Wright. Linking to linguistic data categories in ISOcat. LDL 2012. eHg - New Trends in e-Humanities

  28. Thank you for your attention! Visit www.isocat.org Questions? www.isocat.org/forum/ isocat@mpi.nl Acknowledgements Thanks to anyone at TLA, Sue Ellen Wright, InekeSchuurman, Marc Kemps-Snijders, CLARIN-NL, CLARIN, ISO TC 37 eHg - New Trends in e-Humanities

  29. A whole litter of cats! Linguistic resource (schema) Linguistic knowledge base Data categories Containers Concepts Relation Schema Registry - SCHEMAcat Data Category Registry - ISOcat Concept Registry Relation Registry - RELcat eHg - New Trends in e-Humanities

  30. ISO 11179: concepts vs. data elements/categories ISO 12620 Data Categories eHg - New Trends in e-Humanities

More Related