slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Language Archiving at the MPI PowerPoint Presentation
Download Presentation
Language Archiving at the MPI

Loading in 2 Seconds...

play fullscreen
1 / 48

Language Archiving at the MPI - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

  . Language Archiving at the MPI. Peter Wittenburg MPI for Psycholinguistics D O B E S Archive . NL. G. (DOkumentation BEdrohter Sprachen Documentation of Endangered Languages) (funded by VolkswagenFoundation). Rhein. Nijmegen. Language Archiving at the MPI.   .

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Language Archiving at the MPI' - zach


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Language Archiving at the MPI

Peter Wittenburg

MPI for Psycholinguistics

DOBES Archive

NL

G

(DOkumentation BEdrohter Sprachen

Documentation of Endangered Languages)

(funded by VolkswagenFoundation)

Rhein

Nijmegen

Language Archiving at the MPI

slide2

Still a large variety of languages

  • currently 6500 languages world-wide
  • Distribution
    • Africa 1995
    • S/SE Asia 1400
    • Neuguinea 1109
    • Southamerica 419
    • North-Asia 380
    • Central-America 300
    • Pazific Area 250
    • Australia 250
    • North-America 209
    • Europe 209

Language Archiving at the MPI

slide3

Language endangerment

  • 97 % of the people use 4% of the languages
    • 96% of the languages are being spoken by 3% of the people
    • approx 6000 of the languages are spoken by about 200 Mio
    • people
    • in average: 30.000 speaker per language
      • for 50% less than 10.000, for 25% less than 1000
  • for 50% the number of speakers is decreasing dramatically
  • pessimistic view (according to Crystal):
    • 90 % of the languages will be extinct around 2100!!
    • i.e. every second week a language becomes extinct!!

Language Archiving at the MPI

slide4

what can we do?

  • Documentation + Revitalization
  • 2000 DOBES Programme of the VolkswagenFoundation
  • many other initiatives and institutions – all to be complementary
  • VolkswagenFoundation is devoted to primarily support research
    • teams get funds for documentation (in general 3 years +)
    • had a very intensive pilot phase full of useful discussions
    • it was obvious that all teams felt the need to help the language
    • communities (including the archiving team)

Language Archiving at the MPI

slide5

How to do a language documentation?

  • based on N. Himmelmann “Documentary and Descriptive Linguistics”
  • Documentation: primary focus is on collection, transcription and
  • translation of primary data (observations, elicitations, ...)
  • Description: primary focus is on linguistic analysis and special phenomena
  • the methods and the results are different

Language Archiving at the MPI

slide6

How to do a language documentation?

  • there is an overlap between the two poles: documentation and description
  • no interlinear description without a morphological analysis
  • Documentation has to
    • deliver a comprehensive representation of the “linguistic habitudes and
    • traditions”
    • document spoken language in its communicative and cultural
    • background
      • observed linguistic habitudes and meta knowledge
      • holistic view of language is important
    • be interesting for other disciplines – in particular primary data
    • help the language community
    • therefore a natural focus on audio&video recordings

Language Archiving at the MPI

slide7

DOBES language documentation

  • language on its cultural background
  • “theory-neutral” representation
  • lots of multimedia (audion, video) recordings as basis
  • where possible base everything on primary data
  • linguistic goals
    • annotations (orthographic transcription, translation, ...)
    • only for a small part a morphological/syntactical analysis
    • sketch grammar, limited topic-oriented lexicon
  • also ethnologists, musikologogists, ethnobiologists involved
  • in total about 3 years
  • idea: later generations should be able to reproduce the language
  • material could later be extended

Language Archiving at the MPI

slide8

Traditional annotation

Text Annotation

slide9

Modern annotation

Multimedia Annotation

slide10

DOBES Map

Svan/Udi/

Tsova-Tush

Chintang/Puma

Tofa

Nenets

Archiv

Sami

Hocank

Beaver

Wichita

Mawe/Bakairi/ Katxuyana

Salar/Monguor

Chontal

Totoli

Lacandon

Tsafiki

Sri Lanka

Malay

Kuikuro

Bora/Ocaina

Semang

Teop

Trumai

Saliba

Waima’a

Aweti

Chipaya

Akhoe

Hai//om

!Xoo

Iwaidja

Marquesan

Chaco

Languages

Jaminjung

  • 30 documentation teams (at MPI also 30 expeditions per year)
  • 1 Archiving Team

Language Archiving at the MPI

slide11

Labial

(Post-) alveolar

Velar

Glottal

Stops

voiceless unaspirated

(p)

t

k

'

voiceless aspirated

(ph)

th

kh

voiceless ejective

p'

t'

k'

voiced

b

d

g

Fricatives

plain

(f)

s

h

???

s'

Nasals

plain

m

n

???

mh

nh

???

m'

n'

Laterals

plain

l

???

lh

???

l'

Tap / trill

r

rr

Glides

plain

w

???

wh

???

w'

Waima’a (East Timor)

MauricioBelo, Caisido

village

John Bowden, Australian

National University

John Hajek, University of

Melbourne

Nikolaus Himmelmann,

Ruhr-Universität

Bochum

la enen i

at before PTL

Once upon a time

bu taha k’omu ruo bu wai-dura loo ligasaun ini

HON mud ball and HON cricket make closeness RCP

A ball of mud and a cricket were friends

sire ruo laka khuu rahmhutu busa

3p two go clean together garden

The two of them went to clean the gardens together

slide12

Trumai (Amazonas)

  • Stephen Levinson, MPI Nijmegen 
  • Raquel Guirardello-Damian, Museu Paraense Emílio Goeldi
  • about 100 people
  • about 51 speaking Trumai

Language Archiving at the MPI

slide13

Salar/Monguor (China)

  • Shaman in Huzhu
  • Mongghul county
  • Drummers in the Nadun festival
  • Minhe county
  • Salar villages along the
  • Yellow River
  • Salar children above
  • Dashyinix village

Painting the faces of possessed

Wutu, Niandehu township

Language Archiving at the MPI

slide14

Tofa, Tozhu, Tsengel Tuva, Tuha (Sibiria)

  • David Harrison (Yale)
  • Brian Donahoe (Manchester)
  • Sven Grawunder (Halle)
  • Language—its structure
  • and sounds.
  • Oral folklore—texts,
  • narratives and personal
  • stories, belief systems,
  • naming systems.
  • Music—singing and
  • sound mimesis.
  • Traditional ecology—
  • nomadsm, pastoralism,
  • hunting and reindeer
  • herding

Shaman Ceremony

Language Archiving at the MPI

slide15

Language documentation for whom?

  • for interested researchers
  • for students and schools
  • for journalists
  • for the interested public
  • for the language communities
  • for future generations

Language Archiving at the MPI

slide16

For language communities

  • language maintenance or even revitalization
  • maintainenance of the language, identity, self-conciousness
  • creation of school and other educational material
  • support local/regional centers (create and dl complete copies)
  • improve access to archives
  • in communities big interest in recordings – in particular video

Language Archiving at the MPI

slide17

For future generations

  • in a future world of mono cultures it will become important to know about earlier diversity
  • as now it will be important to know the own roots
  • it may be relevant to point to the different types of languages
  • let’s be honest: we don’t know what future generations will do with the
  • material

Language Archiving at the MPI

slide18

Why archives?

  • many reasons
    • Dietrich Schüller: 80% of our recordings about culture and
    • languages are endangered!
    • storage inadequate (Meda, Formats, PC, ...)
    • selection of suitable technologies requires expert knowledge
    • creation of redundat storage and migration is important
    • requires discipine and has to be independent on persons
    • migration to new technologies can be very expensive
  • only centres can do this
  • AND: requires explicitness – at the end a viewable corpus
  • international trend:
  • DOBES, AILLA, ELAR, PARADISEC, LACITO, ...

Language Archiving at the MPI

slide19

What is a “modern” digital archive?

  • traditional archives
    • focus on preserving physical content
    • access not permitted
  • digital archives
    • physical object is almost irrelevant (Tape, CDROM)
    • content has to be preserved
    • why this revolutionary change?
      • copies can be made lossless (let’s be careful with compression)
      • copies can be created with low costs
  • modern digital archive
    • long-term preservation fo the content (Migration, Distribution)
    • access to the content
    • enrichement without affecting the content
    • sensitive management of access
  • DOBES has to be a living archive (interactive, expandible)

Language Archiving at the MPI

slide20

2000 years

1000 years

500 years

250 years

0 years

Long-term preservation

  • can we guarantee survival of bit-streams? NO
  • we can increase the chances of survival? YES
  • our storage media are not adequate
  • how to do it
    • continuous migration (copies to new generation)
    • world-wide distribution (now within Germany/NL)
    • problem of interpretability not solvable
    • have to take care of ethical/legal aspects
    • crucial for survival are maintenance costs
    • all MPI material is available in 7 copies at different locations

various e-media

clay tablets

Language Archiving at the MPI

slide21

domain of

physical

resources

conceptual

domain of

resources

Pillars of Digital Archives I

  • strict separation of physical and logical access layers
    • physical domain is for System Managers and Archive Managers
    • and changes
    • logical domain (created by linguists) remains and is stable
    • metadata is the glue – have to be maintained

system

manager

corpus

manager

user

creator

Language Archiving at the MPI

slide22

Pillars of Digital Archives II

Archive Organization

Layer of Language

Layer of Sessions

Song

Book

Video

Recording

Intro

Films

Notes

Sound

Recording

Lexicon

Annotations

Language Archiving at the MPI

slide23

Pillars of Digital Archives III

  • separation between object and instance
    • need Unique Resource IDs
    • and robust “Resolving” mechanism

MPI

Repository

mapping

MPI

Portal

Metadata

mapping

GWDG

Repository

mapping

XYZ

Portal

Metadata

URID

Resolver

Language Archiving at the MPI

slide24

Pillar of Digital Archives IV

  • need Versioning
    • nothing may be deleted, but annotations will be changed!
    • research world is dynamic – we want enrichment/extension

userx=read

usery=read

etc

userx=write

usery=read

etc

URID Resolver

Language Archiving at the MPI

slide25

Principles V – Authentication&Authorization

  • authentication and authorization has to be separated
    • URIDs are central link to authorization information
    • need to have space for policies, procedures, declarations etc
    • but administrative effort has to be minimized!!!

userx=read

usery=read

etc

userx=write

usery=read

etc

URID Resolver

Language Archiving at the MPI

slide26

Principles VI – Formats

  • only open, well-documented and widely used formats (encoding standards) should be used in the archive
  • where possible generic schemas should be the basis
    • in DOBES strong recommendations for a few archival formats
      • JPEG/TIFF/PNG, MPEG2, Linear PCM, UNICODE, XML
      • Plain Text, HTML, (PDF) possible
    • at MPI less restrictive (therefore great danger with some types)
    • for presentation purposes also MPEG1/4, MP3, HTML
    • as import formats large variety (Shoebox, CHAT, WORD, ...)
    • conversion as much as possible towards generic files (LMF, EAF, ?)
    • archived objects have to be stored in a neutral way and accessible as individual objects
    • no encapsulation for primary objects
    • nevertheless: MPI archive takes almost all data (even 16mm films)
      • but conversion can be very costly

Language Archiving at the MPI

slide27

MPI Archive – state

  • more than 150.000 Objects (in online archive - ~1/3 of the data)
  • in total more than 15 TB
  • per year about 4 TB in addition
  • several sub-archives (EL, SL, ESF, CGN, ...)
  • MPI archive ingest is open for other people !!!
  • completely structured by open XML files based on IMDI schema
  • a complete machinery available
  • are working on URIDs & Versioning at this moment

Language Archiving at the MPI

slide28

Archive Utility Layer

Ontological

Knowledge

User

Authentication

Access

Rights

Metadata

Tools

Archive Access

Annotation

Exploitation

Lexicon

Exploitation

Text

Exploitation

Data

Ingestion&

Management

Archive Enrichment

Lexical

Encoding

Web

Commentary

Media

Annotation

MPI Archive – Access

The

Archive

Domain of

Registered Primary and

Secondary Resources

User

Domain of

Descriptive

Metadata

Primary

Resources:

Texts

Images

Sound

Movies

slide29

MPI Archive – Metadata and Simple Access

  • metadata is open!
  • what is minimal metadata? – ongoing discussion
  • IMDI Editor
  • BatchModifier (to change lots of IMDI files)
  • IMDI XML Browser (operates in distributed XML domain)
  • IMDI HTML Browsing (on the fly transformation of XML)
  • structured search in XML and HTML domain
  • unstructured search in XML and HTML domain
  • searchable via Google
  • geographic browsing via Google Earth (work in progress)
  • DC/OLAC bridge via OAI port (all IMDI stuff can be harvested)
  • manuals and training courses
  • direct access to simple objects via plug-ins
  • complete sub-tree download

Language Archiving at the MPI

slide30

Geographic Browsing

slide31

Geographic Browsing

slide32

Geographic Browsing

slide33

MPI Archive – Upload Access

  • two options
    • manual integration exceptions are easy 
    • too many teams (~60)
    • LAMUS controlled integration exceptions are difficult
  • users do it themselves (?)
  • LAMUS features
    • - web-based operation
    • - request of a work space
    • - specification of an accepted upload node (archive anchor)
    • - extend and manipulate the corpus structure
    • - upload metadata descriptions
    • - upload any type of resources (configurable format control)
    • - create a linked sub-archive in the workspace and integrate this into the archive
    • - checks to guarantee consistency and format compliance

Language Archiving at the MPI

slide34

MPI Archive – Utilization Access

tool is ANNEX

Language Archiving at the MPI

slide35

MPI Archive – Utilization Access

tool is LEXUS

Language Archiving at the MPI

slide36

MPI Archive – Utilization Access

  • Problem
  • different structures and formats
  • different terminologies

tools are ANNEX/LEXUS

Language Archiving at the MPI

slide37

MPI Archive – State of Access

  • at this moment almost anything from DOBES is closed
  • lots of requests by journalists
  • first 15 teams have to finish these months
    • working hard
    • changing a lot until last minute of course
    • expect some stuff to become open
    • but much to be handled on requests

Language Archiving at the MPI

slide38

End

Mark Abley (Canadian)

Each time we lose a language

the ghosts who made use of it

cast a new bell.

The voices magnify. Soon,

listen, they’ll outpeal

the tongues of earth.

Thanks for your attention.

Language Archiving at the MPI

slide39

Lots of differences

  • Differences at all linguistic layers
    • Phonemic
    • Prosody
    • Phonology
    • Morphology
    • Syntax
    • Semantics
    • Pragmatics
  • Reduced Languages
  • Whistling of Gomera fishermen
  • Sign Language of Plains-Indians
  • “Computer” Languages
  • ...

Language Archiving at the MPI

slide40

Sound Systems

Vocal – Distribution (28 languages)

Spectra and Formants

F2

F1 F2 F3 F4 F5

F1

Formants over time

F5

  • Rotoka (Papua-Neuguinea)
    • Vokals a/e/i/o/u
    • 6 Consonants p/t/k/v/r/g
  • !Xoo (South-Africa)
    • 141 Sounds incl. click-sounds

F4

F3

F2

F1

slide41

i Zeug

i vermuten

i Stuhl/Sessel

i Bedeutung

Tone Systems

  • modulation of segmental information
  • by Prosody
  • stretches across phrases and sentences
  • Tones: meaning of words
  • Swedish: 2 Tones (anden – ándén)
  • German: aufbäumen – aufBäumen
  • Mandarin Chinese: 4 tones
  • Kantonese: 9 tones
  • Vietnamese: 8 tones
  • some so-aisan languages: up to 15 tones

Intonation

dr ai st

Mandarin Chin. 4

Language Archiving at the MPI

slide42

verb stem

Morphosyntax

  • Rules for the generation of words and grammatical structures
  • strictly isolating languages: one morpheme – one word
  • Chinese is an isolating language
  • another extreme are the polysynthetic languages
  • example of the Yup’ik inuit
  • tuntussurqatarniksaitengqiggtuq
  • tuntu ssur qatar ni ksaite ngqiggt uq
  • Renntier jagen FUT sagen NEG wieder 3SG:IND
  • er hatte noch nicht gesagt dass er Renntiere jagen wolle
  • basic principle: stem is inflected by many affixes
  • for us unusual: isolated core morphemes cannot be interpreted
  • “ssur” uttered in isolation does not make sense

Language Archiving at the MPI

slide43

Dialog style

  • norms to express things/activities is different
  • example from Kilivila (Trobriand Islands – Neuguinea)
  • Person:Ambeya
  • Where do you go to?
  • Gunter:(wants to say: I will wash myself)
  • Bala bakakaya
  • I will go I will take a bath
  • Host:
  • Bila bikakaya bike’ita bisisu bipaisewa
  • 3.Fut-gehen 3.Fut-baden 3.Fut-zurückkommen 3.Fut-sein 3.Fut-arbeiten
  • He will go – he will take a bath - he will come back – he will stay -
  • he will work.
  • He will take a bath, come back again and work with us

Language Archiving at the MPI

slide44

Pronoms

  • in Kilivila
    • the inclusive and exclusiveDual
    • we two – myself and the others except you
  • in Paamese (Vanuatu - Archipel)
    • in addition thePaukal
    • “a few”

Language Archiving at the MPI

slide45

Spatial orientation

absolute

system

above

above

behind

north

egocentric

system

east

right

west

south

below

  • Herberger would use the egocentric system to describe the scene
  • Aborigines would chose the absolute system – for us hardly possible:
  • “the ball lies east of the player”

Language Archiving at the MPI

slide46

Awareness

  • since 1866 efforts to preserve diversity in nature
  • 1991 problem in focus of American Linguistic Society
  • 1992 discussion at the Intern. Conference of Linguistics
  • 1992 AG for endangered languages in German linguistic society
  • 1993 UNESCO project to create the red list
  • 2000 DOBES programme of the VolkswagenFoundation
  • within 2 decades broad awareness amongst linguists
  • David Crystals amongst first semester students:
    • 75% don’t know anything about the problem
    • most don’t see a problem
  • how does this come:
  • attention for tigers etc but not for languages?

Language Archiving at the MPI

slide47

Factors are known

  • external factors
    • military suppression
    • religious conversion
    • economic dominance
    • cultural dominance
    • educational suppression
  • internal faktors
    • negative attitude towards own language
    • avoidance of discrimination
    • hope to earn (more) money
    • improvement of mobility
    • youngsters are trend followers
    • ...

Language Archiving at the MPI

slide48

MPI Archive – Content Overview