1 / 21

Standards and Tools: DOBES and CLARIN Views - resumé after about 8 years -

Standards and Tools: DOBES and CLARIN Views - resumé after about 8 years -. Peter Wittenburg , André Moreira The Language Archive - Max Planck Institute CLARIN European Research Infrastructure. Content . CLARIN vs. DOBES - differences? Tools vs. Standards - differences?

beyla
Download Presentation

Standards and Tools: DOBES and CLARIN Views - resumé after about 8 years -

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Standards and Tools: DOBES and CLARIN Views - resumé after about 8 years - Peter Wittenburg, André Moreira The Language Archive - Max Planck Institute CLARIN European Research Infrastructure

  2. Content CLARINvs. DOBES - differences? Tools vs. Standards - differences? Overall Comparison TLA Team - Landscape and Strategy Technology - Mainstream influences Conclusions

  3. DOBESvs. CLARIN • DOBES is about the documentation of endangered languages • (as many other comparable initiatives) • documentation teams are under time pressure • thus efficiency is required (transcription: 1-35, translation: 1-25) • can be facilitated by good tools • documentation certainly is for this generation of • researchers, speech communities, students, public, etc. • (primary focus of DOBES and teams) • documentation is also for future generations • documents part of our cultural heritage • languages encode knowledge about natures and cultures • historical material helps finding our identity • therefore DOBES has a short-term and a long-term challenge

  4. DOBESvs. CLARIN • CLARIN is about an interoperable + persistent infrastructure for LRT • landscape is fragmented and nothing fits together • thus researchers working on data can't be efficient • (knowledge workers spend 40% of time on finding resources, • making things compatible etc) • can be facilitated by good standards and agreements • infrastructure certainly is for this/next generation of • researchers, students, "citizen scientists", etc. • enable "better" research if it is "data-driven" • infrastructure is also for future generations • ensuring access to our research records • lots of data is highly endangered !!! • comparing "old" data with "new" data • therefore CLARIN has a short-term and a long-term challenge

  5. DOBESvs. CLARIN: interoperability • DOBES • community of documenting field linguists • is interoperability an issue? well I still don't know • interoperable with whom? • cross-corpus work based on data is still to come • of course some practical barriers (language) • CLARIN • infrastructure covering "all" language resources & tools • (named entity recognition relevant for everyone) • is interoperability an issue: YES - it's in the focus • otherwise always barriers to tackle relevant questions • otherwise data-driven research too expensive • seems that here is a clear difference in primary objectives

  6. DOBES and CLARIN

  7. DOBES and CLARIN

  8. DOBES and CLARIN

  9. DOBES and CLARIN

  10. DOBES and CLARIN

  11. Toolsvs. Standards • who dears to doubt that • tools determine our "productivity" • tools influence attractiveness of solutions • people are used to tools - who wants to learn new stuff? • tools need to be egocentrically built • development is expensive (UI) • fast development cycles are necessary • SW management is very expensive and • eats up person power • ~ 80 % of all software developments fail • lot of SW developed will die • quickly since not enough • money to maintain it • tools have a short lifecycle • of in average about 10 years functionality time

  12. Tools vs. Standards • who dears to doubt that • standards live almost forever  • de facto lifetime comparatively high • standards are in general not attractive for users • except for some XML "fans" • standards should be hidden and only experts • need to read all documents • standards building has some form of • altruism (if big industry is not involved) • costs lot of time and effort • (ISO TC37/SC4 started 2002 at LREC) • risk of being quickly outdated • will a standard be accepted? • implementing standards in tools can be expensive • (moving target, complexity of standard, etc)

  13. Tools and Standards

  14. Tools and Standards

  15. Tools and Standards

  16. all together • for CLARIN no separation - symbiosis between short-term tool • support and long-term interoperability facilitation • for DOBES there seems to be a difference

  17. Landscape for TLA Team • being archivist and providing access to stored material in DOBES (+MPI) • being in the core of CLARIN/EUDAT infrastructure development • a few major questions: • how can we preserve bit streams and interpretability over long period? • how can we give access to heterogeneous resources and also • support resource creation and manipulation/enrichment? • have about 71 lexica (and many different annotation types) • 61 in the archive, 10 active in LEXUS • created by different tools, • using different structures • using different categories (lexical attributes) • how can we build "generic" tools and frameworks that can cope with • heterogeneity - cannot build/maintain SW too specifically targeted? • how can we build SW in a scenario where there are so many smart • developers out there?

  18. Strategy for TLA Team • Rule 1: have a coherent archive of 34/75 TB • i.e. convert "everything" to stable formats with explicit syntax/encoding • and check quality • otherwise long term curation and access too expensive • costs for late curation and manual migration are extreme • Rule 2: base tool development on open and "generic" formats • EAF for annotations turned out to be flexible enough over 10 years • LMF is a flexible model for lexicon structures • "LEGO" approach makes some people frightened • but flexibility not even sufficient for field linguists • yet no agreement on an exchange format - a disaster  • ISOcat for registering semantics (is it generic enough?) • Rule 3: provide converters and interfaces for major tools/formats • Toolbox, CLAN, Transcriber, PRAAT, other XML • time consuming effort (cyclic flow almost impossible)

  19. Is our Strategy Successful? • very difficult to answer - what are the criteria? • strategy allows us to be coherent with both DOBES and CLARIN • strategy was broad enough to help establishing TLA • although • LMF turned out to be very expensive for us • much time investment to participate in x meetings • little understanding from NLP hardcore guys • can't even claim to be 100% compliant or? • some years of instability of the model thus changes of code • thus slowing down development • invent own interchange format for archiving purposes (RELISH ??) • modern lexica are complex objects with inclusions of objects • (images, a/v fragments, internal and archived resources, etc) • finally an approach based on flexible standards will pay off • but it takes more time

  20. Technology (IT) Issues • technology innovation is moving ahead with the web as driving force • designs and tools need to be web-ready • visibility from everywhere • access from everywhere • collaboration support • annotation (incl. relation drawing) support • (there are so many knowledgeable people around) • web-technology subject of high innovation rate • frequent re-design of components • what is the stable core to keep costs low and make code • maintenance feasible?

  21. Conclusions • research communities naturally more interested in tools • research infrastructure work needs to find a balance between • short- and long-term aspects • however, need to store data following general IT principles • explicit syntax, declared semantics, open formats • need to build better tools to support standards • and/or to convince companies to adopt standards • but tool building based on standards can be more expensive and • time consuming • RELISH is very good to compare TEI, LMF and LIFT • RELISH is very good to compare ISOcat and GOLD • we need a strategy for TLA to support one (or two) exchange formats and • one needs to be based on a standard (data will go into the archive)

More Related