Xml based language archiving
Download
1 / 20

XML-Based Language Archiving - PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on

XML-Based Language Archiving. P. Wittenburg, H. Brugman, D. Broeder, A. Russel Max-Planck-Institute for Psycholinguistics [email protected] www.mpi.nl www.mpi.nl/DOBES. XML Workshop Lissabon May 2004. The MPI Archive.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' XML-Based Language Archiving' - zudora


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Xml based language archiving

XML-Based Language Archiving

P. Wittenburg, H. Brugman, D. Broeder, A. Russel

Max-Planck-Institute for Psycholinguistics

[email protected]

www.mpi.nl

www.mpi.nl/DOBES

XML Workshop

Lissabon

May 2004


The MPI Archive

  • the MPI language resource ARCHIVE is the backbone for the research

  • it can be compared with a fusion reactor in physics

  • for more than 100 persons it is the research instrument

  • it is an instrument not only for our researchers but also for others

    • international collaborators

    • speech communities (not yet ready for them)

    • classes (university, schools)

    • journalists

  • it is dynamic instrument – it changes constantly, its size varies

  • many researchers and teams contribute – all in different ways and speed

  • teams from outside and inside

  • where do we speak about

    • in total more than 30.000 sessions (recording units)

    • every session media files, annotation files, etc

    • further many textual resources (lexica, field notes, …) and images

    • all together (> 8/2 TB)

XML Workshop

Lissabon

May 2004


Some terms

Archive full and organized collection of all language resources

Corpus a sub-set of resources from the archive created by

a researcher or a researcher team with a specific

linguistic purpose in mind

(recursive definition)

Metadata in general all secondary data derived from primary

data such as recordings, texts, …

Metadata here keyword type description of typical characteristics of

sessions for discovery and management purposes

embedded in a metadata organization

XML Workshop

Lissabon

May 2004


What is there?

  • Gesture & Speech data

  • Multimodal data

  • Sign Language resources

  • Split-brain resources

  • Child Language Acquisition data

  • Adult Language Acquisition data

  • Speech Corpora (Dutch Spoken Corpus for in-house use)

  • Cross-lingual resources

  • Minority languages resources elicited

  • Minority languages resources non-elicited

  • Endangered languages resources (DOBES)

XML Workshop

Lissabon

May 2004


Dobes programme

Chintang/Puma

Tofa

Svan/Tush

DOBES programme

Hocank

Wichita

Salar/Monguor

Chol

Mawe

Lacandon

Tsafiki

Ega

Waima’a

Kuikuro

Uru-Chipaya

Trumai

Teop

Aweti

Chaco

Hai//om

!Xoo

Iwaidja

Marquesan

  • started September 2000 with 8 teams in a pilot phase

  • now 25 documentation teams

UNESCO Seminar

Vilnius

March 2004


Dobes programme1

Tofa

Kuikuru

DOBES programme

Salar/Monguor

la enen i

bu taha k’omu ruo bu wai-dura loo ligasaun ini

sire ruo laka khuu rahmhutu busa

Aweti

UNESCO Seminar

Vilnius

March 2004

Waima’a

Trumai


The No Organization

The

CHAOS

X

all individuals and teams acting completely uncoordinated

MPI had this situation and still suffer sometimes

XML Workshop

Lissabon

May 2004


Archive as a Multi-User Instrument

The

Archive

all individuals and teams creating independently

but ingest in a coordinated manner

corpus management is expensive -> LAMS

XML Workshop

Lissabon

May 2004


Motivations at the beginning

  • Our Archive is one of many in the Internet – make an integrated

  • domain of language resources for the users

  • easy integration with others

  • tools have to operate in a local environment as well (field linguist)

  • different types of users would like to access the material

  • physical layer will change continuously (new storage technology, …)

  • access to data via virtual layer (almost ready for URIDs)

  • different types of metadata descriptions

  • core plus X

XML Workshop

Lissabon

May 2004


IMDI Metadata Model

  • metadata is the glue that keeps all together

    • bundles media and annotations

    • bundles lexica, grammars etc with languages

    • bundles field notes with trips

    • contain references to physical locations

    • etc

  • the physical layer is for the system managers (never know what they do)

IMDI domain

the “boring” layer

Lund

info files

MPI

Kilivila

Trumai

different organization

layers

Spencer

info files

lexica

grammar

….

Dialect

text

sound

image

movie

annotations

eye movements

look at IMDI Metadata also as a virtual distributed file system

all in schema-based XML

XML Workshop

Lissabon

May 2004


IMDI Metadata Model

  • metadata is the vehicle to support discovery (browsing & searching)

  • metadata is the vehicle to carry out archive management (starting)

    • check consistency

    • carry out copying actions for others

    • take care of access management for in-house and externals

    • associate Unique Resource Identifiers (URIDs)

    • MPI/Lund/INL now turn this into Archive Management system

XML Workshop

Lissabon

May 2004


IMDI Metadata Set

  • details can be found at www.mpi.nl/IMDI and www.mpi.nl/ISLE

  • stabilized over > 4 years

  • emerged from broad discussions with LE, FL, SL, …

  • is a result of the ISLE project

  • is used in INTERA, ECHO, DOBES and other institutions and initiatives

  • is a structured set (participants -> age, language, …)

  • compared to Dublin Core rich metadata set

  • is based on proper concept definitions using linguistic terminology

  • besides core elements

    • also elements for multimodal corpora, lexica, written resources

  • is based on an XML schema

  • is based on several schema-based controlled vocabularies

  • allows extensions by key-value pairs

  • supports profiles (special extensions for example for Sign Language Com)

XML Workshop

Lissabon

May 2004


IMDI Infrastructure

  • the IMDI basis is made up of linked XML files

  • distributed infrastructure simple to achieve

  • everyone can build his/her own services!!!

  • MPI (and others …) provide open source tools

  • Databases are special instances for special purposes

  • (searching, access management, OAI harvesting, …)

XML

browsing

harvesting

search

tool

XML

DB for

searching &

management

XML

XML

XML

HTML

browsing

tool

XSLT

on the fly

conversion

management

tools

XML

XML

XML

XML

DB with

DC records

OAI

type

harvesting

XSLT

on the fly

conversion

other

services

XML

XML Workshop

Lissabon

May 2004


IMDI Tools

Browsing & Searching

IMDI Browser & IE

IMDI Domain

via INTERNET

corpus structure

generation

Excel,

Treebuilder

Lund

University

MPI

ESF

DOBES

Tofa

Trumai

Metadata Editing

IMDI Editor

Excel

S

S

S

S

S

S

S

S

S

S

S

S

Session

exploitation

via several

immediately

executable

programs

DOBES Training

DOBES Overview

May 10-14, 2004

HRELP Workshop

London

November 2003

XML Workshop

Lissabon

May 2004


What about the resources?

  • immediate strategy

    • convert everything to archivable formats

    • get as much coherence as possible

  • video: MPEG2 (derived objects such as MPEG1/4, SMIL, …)

  • audio: 16 bit linear PCM/48 kHz (derived objects such as MP3)

  • images: JPEG (although compressed), TIFF

  • annotations: EAF a modern XML-based annotation format

    • receive CHAT, Shoebox, Word, Database stuff, Transcriber, …

  • lexica: nothing yet – rely now on LMF (coming ISO norm)

    • receive Shoebox, Word, Excel, Database stuff

  • texts: plain text, html

XML Workshop

Lissabon

May 2004


ELAN Annotation Format

  • basis is the Abstract Corpus Model

  • checked whether it has enough representational power

  • very much in line with AG from Bird&Liberman

  • ordered annotations on typed tiers

    • time references or symbolic references

    • dependencies

  • details in schema or papers

  • flexible with respect to tier number and types

XML Workshop

Lissabon

May 2004


Resource Exploitation Tools

MPI

ESF

Browsing & Searching

IMDI Browser & IE

IMDI Domain

via INTERNET

Lund

University

MPI

DOBES

Tofa

Trumai

Combined Web-based

exploitation&commentary

frameworks

S

S

S

S

S

S

S

S

S

S

S

S

ELAN

HTML

WMP

XML Workshop

Lissabon

May 2004

SMIL


Looking Back

  • made the basic decisions for all the work about 5 years ago

  • decisions were not too bad

  • in particular the decision to rely on XML as basic representation

  • format and usage of DB only for specific purposes

    • everything is open (given access rights) and in good state

    • everything can be distributed

    • everything in well-documented archival formats

  • have developed supporting tools

  • have thought about long-term persistence (5 copies right now)

  • had to pay for this way

    • everything was fairly new at the beginning

    • less support for nice UI

    • DB intrinsic integrity checks

    • “search” integrated

XML Workshop

Lissabon

May 2004


Looking Forward

MPI

ESF

The DELAMAN GRID

DOBES/MPI

EMELD

  • have a clear vision

  • need to integrate across archives

    • people don’t want to see MPI

    • want to see Trumai, Tofa, …

    • therefore ISO is important

  • have to open up our archives for

  • simple exploitation

  • have to create simple commentary

  • frameworks

  • have to create mobility frameworks

  • (David Nathan, ELAR)

  • need middleware to establish

  • stable and manageable Data GRIDs

  • XML will remain our key pillar

ELAR

ANLC

AILLA

AMPM

PARADISEC

LACITO

Combined Web-based

exploitation&commentary

frameworks

XML Workshop

Lissabon

May 2004


End

Thanks for the attention

XML Workshop

Lissabon

May 2004


ad