XML-Based Language Archiving

XML-Based Language Archiving P. Wittenburg, H. Brugman, D. Broeder, A. Russel Max-Planck-Institute for Psycholinguistics peter.wittenburg@mpi.nl www.mpi.nl www.mpi.nl/DOBES XML Workshop Lissabon May 2004

The MPI Archive • the MPI language resource ARCHIVE is the backbone for the research • it can be compared with a fusion reactor in physics • for more than 100 persons it is the research instrument • it is an instrument not only for our researchers but also for others • international collaborators • speech communities (not yet ready for them) • classes (university, schools) • journalists • … • it is dynamic instrument – it changes constantly, its size varies • many researchers and teams contribute – all in different ways and speed • teams from outside and inside • where do we speak about • in total more than 30.000 sessions (recording units) • every session media files, annotation files, etc • further many textual resources (lexica, field notes, …) and images • all together (> 8/2 TB) XML Workshop Lissabon May 2004

Some terms Archive full and organized collection of all language resources Corpus a sub-set of resources from the archive created by a researcher or a researcher team with a specific linguistic purpose in mind (recursive definition) Metadata in general all secondary data derived from primary data such as recordings, texts, … Metadata here keyword type description of typical characteristics of sessions for discovery and management purposes embedded in a metadata organization XML Workshop Lissabon May 2004

What is there? • Gesture & Speech data • Multimodal data • Sign Language resources • Split-brain resources • Child Language Acquisition data • Adult Language Acquisition data • Speech Corpora (Dutch Spoken Corpus for in-house use) • Cross-lingual resources • Minority languages resources elicited • Minority languages resources non-elicited • … • Endangered languages resources (DOBES) XML Workshop Lissabon May 2004

Chintang/Puma Tofa Svan/Tush DOBES programme Hocank Wichita Salar/Monguor Chol Mawe Lacandon Tsafiki Ega Waima’a Kuikuro Uru-Chipaya Trumai Teop Aweti Chaco Hai//om !Xoo Iwaidja Marquesan • started September 2000 with 8 teams in a pilot phase • now 25 documentation teams UNESCO Seminar Vilnius March 2004

Tofa Kuikuru DOBES programme Salar/Monguor la enen i bu taha k’omu ruo bu wai-dura loo ligasaun ini sire ruo laka khuu rahmhutu busa Aweti UNESCO Seminar Vilnius March 2004 Waima’a Trumai

The No Organization The CHAOS X all individuals and teams acting completely uncoordinated MPI had this situation and still suffer sometimes XML Workshop Lissabon May 2004

Archive as a Multi-User Instrument The Archive all individuals and teams creating independently but ingest in a coordinated manner corpus management is expensive -> LAMS XML Workshop Lissabon May 2004

Motivations at the beginning • Our Archive is one of many in the Internet – make an integrated • domain of language resources for the users • easy integration with others • tools have to operate in a local environment as well (field linguist) • different types of users would like to access the material • physical layer will change continuously (new storage technology, …) • access to data via virtual layer (almost ready for URIDs) • different types of metadata descriptions • core plus X XML Workshop Lissabon May 2004

IMDI Metadata Model • metadata is the glue that keeps all together • bundles media and annotations • bundles lexica, grammars etc with languages • bundles field notes with trips • contain references to physical locations • etc • the physical layer is for the system managers (never know what they do) IMDI domain the “boring” layer Lund info files MPI Kilivila Trumai different organization layers Spencer info files lexica grammar …. Dialect text sound image movie annotations eye movements look at IMDI Metadata also as a virtual distributed file system all in schema-based XML XML Workshop Lissabon May 2004

IMDI Metadata Model • metadata is the vehicle to support discovery (browsing & searching) • metadata is the vehicle to carry out archive management (starting) • check consistency • carry out copying actions for others • take care of access management for in-house and externals • associate Unique Resource Identifiers (URIDs) • MPI/Lund/INL now turn this into Archive Management system XML Workshop Lissabon May 2004

IMDI Metadata Set • details can be found at www.mpi.nl/IMDI and www.mpi.nl/ISLE • stabilized over > 4 years • emerged from broad discussions with LE, FL, SL, … • is a result of the ISLE project • is used in INTERA, ECHO, DOBES and other institutions and initiatives • is a structured set (participants -> age, language, …) • compared to Dublin Core rich metadata set • is based on proper concept definitions using linguistic terminology • besides core elements • also elements for multimodal corpora, lexica, written resources • is based on an XML schema • is based on several schema-based controlled vocabularies • allows extensions by key-value pairs • supports profiles (special extensions for example for Sign Language Com) XML Workshop Lissabon May 2004

IMDI Infrastructure • the IMDI basis is made up of linked XML files • distributed infrastructure simple to achieve • everyone can build his/her own services!!! • MPI (and others …) provide open source tools • Databases are special instances for special purposes • (searching, access management, OAI harvesting, …) XML browsing harvesting search tool XML DB for searching & management XML XML XML HTML browsing tool XSLT on the fly conversion management tools XML XML XML XML DB with DC records OAI type harvesting XSLT on the fly conversion other services XML XML Workshop Lissabon May 2004

IMDI Tools Browsing & Searching IMDI Browser & IE IMDI Domain via INTERNET corpus structure generation Excel, Treebuilder Lund University MPI ESF DOBES Tofa Trumai Metadata Editing IMDI Editor Excel S S S S S S S S S S S S Session exploitation via several immediately executable programs DOBES Training DOBES Overview May 10-14, 2004 HRELP Workshop London November 2003 XML Workshop Lissabon May 2004

What about the resources? • immediate strategy • convert everything to archivable formats • get as much coherence as possible • video: MPEG2 (derived objects such as MPEG1/4, SMIL, …) • audio: 16 bit linear PCM/48 kHz (derived objects such as MP3) • images: JPEG (although compressed), TIFF • annotations: EAF a modern XML-based annotation format • receive CHAT, Shoebox, Word, Database stuff, Transcriber, … • lexica: nothing yet – rely now on LMF (coming ISO norm) • receive Shoebox, Word, Excel, Database stuff • texts: plain text, html XML Workshop Lissabon May 2004

ELAN Annotation Format • basis is the Abstract Corpus Model • checked whether it has enough representational power • very much in line with AG from Bird&Liberman • ordered annotations on typed tiers • time references or symbolic references • dependencies • details in schema or papers • flexible with respect to tier number and types XML Workshop Lissabon May 2004

Resource Exploitation Tools MPI ESF Browsing & Searching IMDI Browser & IE IMDI Domain via INTERNET Lund University MPI DOBES Tofa Trumai Combined Web-based exploitation&commentary frameworks S S S S S S S S S S S S ELAN HTML WMP XML Workshop Lissabon May 2004 SMIL

Looking Back • made the basic decisions for all the work about 5 years ago • decisions were not too bad • in particular the decision to rely on XML as basic representation • format and usage of DB only for specific purposes • everything is open (given access rights) and in good state • everything can be distributed • everything in well-documented archival formats • have developed supporting tools • have thought about long-term persistence (5 copies right now) • had to pay for this way • everything was fairly new at the beginning • less support for nice UI • DB intrinsic integrity checks • “search” integrated • … XML Workshop Lissabon May 2004

Looking Forward MPI ESF The DELAMAN GRID DOBES/MPI EMELD • have a clear vision • need to integrate across archives • people don’t want to see MPI • want to see Trumai, Tofa, … • therefore ISO is important • have to open up our archives for • simple exploitation • have to create simple commentary • frameworks • have to create mobility frameworks • (David Nathan, ELAR) • need middleware to establish • stable and manageable Data GRIDs • XML will remain our key pillar ELAR ANLC AILLA AMPM PARADISEC LACITO Combined Web-based exploitation&commentary frameworks XML Workshop Lissabon May 2004

End Thanks for the attention XML Workshop Lissabon May 2004

XML-Based Language Archiving