1 / 34

DELAMAN / DAM-LR - the vision -

DELAMAN / DAM-LR - the vision -. Digital Endangered Languages and Music Archives Network Distributed Access Management for Language Resources (EU – Project started at 1.1.05). Peter Wittenburg MPI for Psycholinguistics. When did “we” start?.

garry
Download Presentation

DELAMAN / DAM-LR - the vision -

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DELAMAN / DAM-LR- the vision - Digital Endangered Languages and Music Archives NetworkDistributed Access Management for Language Resources(EU – Project started at 1.1.05) Peter Wittenburg MPI for Psycholinguistics

  2. When did “we” start? • it is just 5 years that we started in our discipline speaking about • large digital online collections • standardizing the formats • open metadata to come to browsable and searchable domains • using open metadata to create well-organized archives • LREC Athens 2000 • first workshop on these issues • start of the ISLE project (linguistic concepts, lexicon, metadata, …) • start of the work on the IMDI metadata infrastructure • in late 2000 also first LDC workshop with OLAC as focus • this is very short time when you want to convince a community

  3. What did we achieve? • have “large” on-line digital archives/collections/Digital Libraries • MPI ~40.000 session bundles (> 100.000 objects) / ~11 TB • DOBES ~1.500 session bundles/ 1500 h • AILLA archive • PARADISEC archive • Lund corpus archive • also in HLT domain larger data centers • also “traditional” archives (Phonogramm Archiv, NAA, …) • etc • idea of web visibility and online accessibility spreads • necessity of central data collection and preservation spreads

  4. What did we achieve? • much evangelization and agreement about standards • “everyone” agrees with XML, UNICODE and linear PCM • “everyone” understands the relevance of schemas to make • linguistic structure and encoding explicit • wrt JPEG and MPEG we are shooting on a moving target, but • don’t yet have real alternatives

  5. What did we achieve? • interoperability is still a dream however … • have metadata gateways in our discipline (OLAC-IMDI) • increasingly often tools are producing correct XML, UNICODE, … • have filters for character encodings and formats although • we miss well-designed and comprehensive services • have started with ontology work to tackle the linguistic aspects • GOLD ontology from E-Meld • ISO TC37/SC4 Data Category Registry • TDS (Dutch Typology Project) meta-language • EAGLES/ISLE/TEI specifications • we are at the beginning • cannot speak yet about fully operational infrastructures • but there are island tools like FIELD, LEXUS, ONTO-ELAN, …

  6. Changing role of Language Archives different groups of people contribute The Archive specialists maintain, unify, check quality, etc different groups of people use the content • at the MPI it is understood that the archive is the capital to build on • in the DOBES programme the point to make results explicit and accessible • only works if we don’t have an “inert, dusty” archives • language archives are dynamic!

  7. DOBES / MPI Archivesas Example

  8. Vision for a single archive Archive Utility Layer done in progress to start Ontological Knowledge User Authentication Access Rights Metadata Tools Lexicon Exploration Text Exploration Data Ingestion& Management Lexical Encoding Web Commentary The Archive Web-based Archive Exploration Annotation Exploration Domain of Registered Primary and Secondary Resources User Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies (Web-based) Archive Enrichment Media Annotation

  9. Content Organization Archive Contents IMDI domain mydomain info files yourdomain IMDI schema EAF schema myanno structure youranno structure LEXUS schema t-lex structure k-lex structure ….. Kilivila info files t-lexicon grammar …. Trumai Tofa info files k-lexicon grammar Tseltal mytext mysound myimage mymovie myannotations yourtext yoursound yourimage yourmovie yourannotations The Archive Domain of Registered Primary and Secondary Resources User Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies

  10. IMDI Based Virtual Layer (corp man) IMDI domain mydomain info files yourdomain Kilivila info files t-lexicon grammar …. Trumai Tofa info files k-lexicon grammar Tseltal mytext mysound myimage mymovie myannotations yourtext yoursound yourimage yourmovie yourannotations Access Management Nijmegen November 2004 • researcher free to define structure • MD descriptions have to be • correct (IMDI schema and CV) • fully distributed domain • sufficient to register the root • URL • searching requires harvesting • HTML browsing requires • harvesting

  11. Ingestion & Management Resource Ingestion IMDI domain MPI info files • upload/define structure • upload/define sessions • upload resources • link resources • define access policy • system to carry out checks DOBES info files lexica grammar …. Trumai Tofa Kilivila Tseltal Data Ingestion& Management text sound image movie annotations eye movements LAMUS Light almost ready text sound text sound text sound The Archive Domain of Registered Primary and Secondary Resources User Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies

  12. IMDI Metadata Infrastructure Archive Utility Layer Metadata Tools Treebuilder HTML Browser Data Ingestion& Management XML Browser Editor The Archive Domain of Registered Primary and Secondary Resources User Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies

  13. Access & User Management Archive Utility Layer User Authentication Access Rights Metadata Tools Data Ingestion& Management The Archive Domain of Registered Primary and Secondary Resources User Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies

  14. Access Management domain of open metadata descriptions MPI CM domain of control personY personX delegation personZ text sound image movie annotations eye movements info files domain of resources to be protected • current solution is centralized – one database • has delegation mechanism to make administration tractable • association of declarations etc is possible • powerful commands from any node to give rights to groups

  15. Web-based Annotation Exploitation Archive Utility Layer User Authentication Access Rights Metadata Tools Archive Exploitation Annotation Exploitation Data Ingestion& Management The Archive Domain of Registered Primary and Secondary Resources User Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies

  16. Web-based Lexicon Exploitation Archive Utility Layer User Authentication Access Rights Metadata Tools Archive Exploitation Annotation Exploitation Lexicon Exploitation Data Ingestion& Management The Archive Domain of Registered Primary and Secondary Resources User Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies

  17. Web-based Text Exploitation Archive Utility Layer User Authentication Access Rights Metadata Tools Archive Exploitation Annotation Exploitation Lexicon Exploitation Text Exploitation meant for field notes, grammars, ethno notes, etc nothing concrete yet but least complex to implement Data Ingestion& Management The Archive Domain of Registered Primary and Secondary Resources User Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies

  18. Web-based Archive Exploitation Archive Utility Layer User Authentication Access Rights Metadata Tools Archive Exploitation Annotation Exploitation Lexicon Exploitation Text Exploitation Data Ingestion& Management ? The Archive Domain of Registered Primary and Secondary Resources Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies User

  19. Ontology Support Necessary Archive Utility Layer Ontological Knowledge User Authentication Access Rights Metadata Tools Archive Exploitation Annotation Exploitation Lexicon Exploitation Text Exploitation Data Ingestion& Management mo = morpho n = noun … ? The Archive Domain of Registered Primary and Secondary Resources Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies User

  20. this is not the same for a stupid search engine The Problem Annotation Lexicon trans dog dog form POS noun no dog wordclass ? ? Annotation ortho dog PS n this is not the same for a stupid search engine

  21. Central Solution trans dog POS noun dog form trans = cat 107, POS = cat 229, noun = cat 531 dog no wordclass ? ? ortho dog form = cat 107, wordclass = cat 229, no = cat 531 PS n ortho = cat 107, PS = cat 229, n = cat 531 contains all relevant linguistic definitions can refer to them given linguistic differences not realistic cat 107 = orthographic transcription cat 229 = part-of-speech cat 531 = noun Central ISO DCR

  22. Individual Solution trans dog POS noun dog form dog no wordclass ? ? ortho dog PS n means lot of work for all individuals given time constraints not realistic will start with this version trans = ortho = form POS = PS = gramcat n = noun = no Linguist’s mapping file

  23. Proper Solution relations central ISO DCR Search Engine relations MPI DCR relations personal DCR how long will it take to be there? nevertheless – have to start now! Domain of Ontologies there will be many knowledge sources

  24. Web-Based Annotation Archive Utility Layer Ontological Knowledge User Authentication Access Rights Metadata Tools Archive Exploitation Annotation Exploitation Lexicon Exploitation Text Exploitation Data Ingestion& Management YET FIRST DOWNLOAD ANNOTATE AND UPLOAD ONLINE ANNOTATION LATER Archive Enrichment Media Annotation ? The Archive Domain of Registered Primary and Secondary Resources Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies User

  25. Web-based Lexicon Editing Archive Utility Layer Ontological Knowledge User Authentication Access Rights Metadata Tools Archive Exploitation Annotation Exploitation Lexicon Exploitation Text Exploitation Data Ingestion& Management Archive Enrichment Media Annotation Lexical Encoding ? The Archive Domain of Registered Primary and Secondary Resources Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies User

  26. Web-based Commentary Archive Utility Layer Ontological Knowledge User Authentication Access Rights Metadata Tools Archive Exploitation Annotation Exploitation Lexicon Exploitation Text Exploitation Data Ingestion& Management Archive Enrichment Comment: This is an interesting relation Type: Semantic Author: Peter Wittenburg Date: 27.9.2004 Media Annotation Lexical Encoding Web Commentary ? The Archive Domain of Registered Primary and Secondary Resources Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies User

  27. Language Archives – The Vision Archive Utility Layer Ontological Knowledge User Authentication Access Rights Metadata Tools Archive Exploitation Annotation Exploitation Lexicon Exploitation Text Exploitation Data Ingestion& Management Archive Enrichment Media Annotation Lexical Encoding Web Commentary ? The Archive Domain of Registered Primary and Secondary Resources Domain of Descriptive Metadata Primary Resources: Texts Images Sound Movies User

  28. Cross-Archive DimensionDELAMAN / DAM-LRVisions

  29. DELAMAN / DAM-LR Map EMELD ELAR INL MPI Lund ANLC AILLA AMPM LACITO AIATSIS PARADISEC

  30. Exchange Resources Raw Data Raw Data • have to take care of long-term data preservation • only chance is world-wide distribution Metadata Metadata data exchange for data survival reasons archive A archive B

  31. Joint Access Domain • Users want to work across administrational • boundaries DOBES Archive Raw Data DOBES Trumai Metadata my personal Trumai archive AILLA Archive Raw Data AILLA Trumai not just copies but result of own creative process Metadata

  32. Goals • it’s about future usage scenarios with distributed archives • it’s about federated language resource archives • it’s about eScience scenarios in linguistics • want to exchange data automatically (list driven) • want to allow people to create integrated virtual working spaces • want to have an integrated access management domain • (one identity, rights go with the copies, …) • first talks in Nijmegen and at HRELP workshops 2003 • foundation at PARADISEC meeting in Sydney 2003 • last workshop in Nijmegen November 2004 • linguists • archivists • (GRID) technologists

  33. Technologies • much technology to achieve our goals is available • A-Select authentication system • Shibboleth authorization system • Handle System for URID resolving • Distributed metadata environment such as IMDI • Storage Request Broker for federated resources • Web-Services for layered services • …

  34. Links • DELAMAN Web-Site www.delaman.org • DELAMAN Workshop-Site www.mpi.nl/delaman/workshop • DOBES Web-Site www.mpi.nl/DOBES • MPI Archive Web-Site www.mpi.nl/world/corpus • MPI Tools Web-Site www.mpi.nl/tools

More Related