1 / 12

LRT Repositories/Archives – State and Future

  . LRT Repositories/Archives – State and Future. Peter Wittenburg MPI for Psycholinguistics CLARIN Research Infrastructure.   . State in LRT Domain. everywhere the same: extreme increase of the amount and complexity of primary/secondary research data

josef
Download Presentation

LRT Repositories/Archives – State and Future

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1.   LRT Repositories/Archives – State and Future Peter Wittenburg MPI for Psycholinguistics CLARIN Research Infrastructure

  2.   State in LRT Domain • everywhere the same: • extreme increase of the amount and complexity of • primary/secondary research data • only very little is visible via “reasonable” portals • the state of the resources is in general bad • UNESCO: 80 % of the recordings about languages • and cultures is highly endangered • encoding, structure and terminology is not well • described/defined • only little is stored in suitable repository/archive • systems • only few institutes have a proper repository/archive • only very few give deposit services

  3.   Types of Resources in LRT Domain • large heterogeneity of resource types • semi-structured texts (newspapers, books, etc) • transcriptions • annotated media recordings (sound, video) • (annotated) time series data • eye tracking, motion tracking, data glove, fMRI, etc • lexica (with multimedia extensions) • grammar descriptions • tree databases (syntax descriptions) • concept registries, relation registries, ontologies • metadata descriptions • schemas, component schemas • etc • referenced objects (resources, collections, fragments)

  4.   Some exceptions • such as the DOBES/MPI archive and others from DELAMAN • few “traditional” centers such as ELDA/LDC/INL/OTA/BAS ... • only few have a clear metadata policy

  5.   Repository “Grid” • existing repository “grid” (joint MD, PIDs, distributed AAI, data exchange) • planned extensions in 2008 • additional metadata harvesting from “centers” (OAI PMH, XML) -> OLAC

  6.   Professional Repository and Archiving University London AIATSIS Canberra 2 Computer Centers in Munich (one from MPG) University Lund • at MPI about 33 Terabyte • > 250.000 resources • 60 Mio annotations • long term • preservation strategy • synchronized regional • archives are essential IIAP Iquitos 2 Copies MPI Nijmegen MPI Leipzig CONICET BA MdI Rio 2 Computer Centers in Göttingen (one from MPG) Belem, Tbilisi, Timor, Bangkok, Windhoek, Katmandu, Birmingham, Berlin, Halle, ... University Kiel

  7.   What is CLARIN? – very short • create an integrated and interoperable landscape of LRT • all based on strong centers with repository/archive • strategy, variety of services and strong national support • i.e. extend what has been started on a small scale • centers will form a “kind of federation” • shared metadata is one of the key pillars

  8.   CLARIN Initiative • CLARIN is an ESFRI Roadmap initiative in SSH • 90 member institutes from 31 EU countries • EC funded RI with 32 partners from 22 EU countries • 25 national commitment statements, i.e. many members • are directly involved due to national funding schemes • preparatory phase 3 years • in some countries already a long-term roadmap concept • much interest from non-EU countries • (Australia, Japan, Korea, US, South Africa, Brazil, • Russia, Argentina, Peru, China) • have to show what we can achieve 

  9.   What else? • LRT is quite active community over many years • close collaboration with ISO TC37/SC4 • (in addition to standards from W3C, OASIS, TEI, etc) • standardized concept registry in progress including • multilingual terminology • various generic data models in progress • example: Lexical Markup Framework • standard for unique and persistent identifiers (Handle) • in progress • new standard for language IDs in progress • new framework for flexible metadata in progress • etc

  10.   What else? • thus: large overlap with DRIVER • CLARIN is focusing on research data which is different • CLARIN is discipline oriented • DRIVER • started from library domain • is discipline crossing ? Grids GEANT

  11.   Expectations wrt DRIVER • CLARIN is very much interested in a collaboration • and see where we can benefit from each other • heard a lot about OAI-PMH but this is most simple aspect • in cross-disciplinary approaches ... • understand the semantic mapping problems when • creating an integrated metadata domain • STITCH: CH sector in the NL; RoR: in Max Planck Society • figure out which vocabulary is offered to which users • is metadata used for research questions or • for “accidental” discovery • many other open questions • CLARIN would be ready for a test • many metadata descriptions ready to be harvested

  12.   Thanks for your attention.

More Related