1 / 29

Corpus Management 101: Creating archive-ready language documentation

Corpus Management 101: Creating archive-ready language documentation. Heidi Johnson The Archive of the Indigenous Languages of Latin America (AILLA) The University of Texas at Austin. Definitions.

arlais
Download Presentation

Corpus Management 101: Creating archive-ready language documentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpus Management 101:Creating archive-ready languagedocumentation Heidi Johnson The Archive of the Indigenous Languages of Latin America (AILLA) The University of Texas at Austin

  2. Definitions • Archive:a secure repository created and maintained by an institution with a demonstrated commitment to permanence and to the preservation of archived resources over the long term. • Language Documentation corpus: the collection of documentation materials created by researchers and native speakers. OLAC Tutorial, Jan.6,2006, Albuquerque

  3. What should you archive - I • Recordings, both audio & video: • public events: ceremonies, oratory, dances… • narratives: historical, traditional, myths,personal, children's stories, ... • instructions: how to build a house, how toweave a mat, how to catch a fish, ... • literature: oral or written - any creative work • conversations: anything that's not too personal OLAC Tutorial, Jan.6,2006, Albuquerque

  4. What you should archive - II • Secondary (derived) materials: • transcriptions, translations, & annotations • field notes, elicitation lists, orthographies • datasets, databases, spreadsheets • sketches, e.g. grammar, ethnography • Photographs • Otherwise unpublished or out-of-print articles OLAC Tutorial, Jan.6,2006, Albuquerque

  5. What you should archive - III • Teaching and learning materials: • primers – children’s readers • calendars, posters, etc. • illustrated dictionaries, encyclopedia • curriculum designs • Anything that other people might find inspiring and useful in their own programs. OLAC Tutorial, Jan.6,2006, Albuquerque

  6. What you should NOT archive • Anything that could cause injury, arrest, or embarassment to the speakers, e.g.: • Pamela Munro's interviews with Zapotecs in L.A. about entering the U.S. illegally. • Gossip that hasn’t aged enough (ancient gossip becomes history & narrative) • Sacred works with highly restricted uses. OLAC Tutorial, Jan.6,2006, Albuquerque

  7. When you should archive • As soon as possible: • to prevent accidental damage or loss; • to get back handy presentation formats; • to build your CV even before you are ready to publish results. • Restrict access to works in progress. • Add transcriptions, annotations, etc. later. OLAC Tutorial, Jan.6,2006, Albuquerque

  8. Why you should archive - I • to preserve resources for future generations. • to facilitate the re-use of materials for: • language maintenance & revitalization programs; • typological, historical, comparative studies; • any kind of linguistic, anthropological, psychological, etc., study that you yourself won't do. OLAC Tutorial, Jan.6,2006, Albuquerque

  9. Why you should archive - II • to foster development of both oral and written literatures for endangered languages. • to make known what documentation there is for which languages. • to build your CV and get credit for all your hard work. OLAC Tutorial, Jan.6,2006, Albuquerque

  10. Archiving is a form of publishing • Metadata is always public. • List Archived Resources on your CV. • Cite data from archived resources: Sánchez Morales, Germán. (1994). "Satornino y los soldados." [audio] Heidi Johnson, (Researcher.) [online] ZOH001R010. Access=public. http://www.ailla.utexas.org: Archive of the Indigenous Languages of Latin America. OLAC Tutorial, Jan.6,2006, Albuquerque

  11. How to build an archive-readycorpus I • Rule #1: Label everything you produce with RUTHLESS CONSISTENCY. If we don’t know what it is, we can’t archive it. • Rule #2: Get in touch with your friendly local archive and ask them to help you. • Rule #3: Test your system before you leave: equipment, catalog method, and labels. OLAC Tutorial, Jan.6,2006, Albuquerque

  12. How to build an archive-readycorpus II • Define a policy concerning IPR and develop a consistent practice for obtaining consent, e.g., forms and/or recorded statements. • Always get permission for everything: • recording • archiving • excerpting, publishing, etc. • Learn how to talk to your consultants about IPR. • Visit the School of Best Practice at: http://emeld.org/school/classroom/ethics/index.html OLAC Tutorial, Jan.6,2006, Albuquerque

  13. Labelling I : recordings • Audio - record a “header” with basic information, in a contact language: • Your name, speakers’ names • Date & place • Name of the language • Brief statement of genre and/or title of work. • ailla_audio_header1.mp3 • Video - go Hollywood: use a clapboard with basic info written on it. OLAC Tutorial, Jan.6,2006, Albuquerque

  14. Labelling II: media and files • Decide on the fundamental organizing theme for your labelling system - • media, e.g. CDs, notebooks • consultants’ names or initials • languages/dialects • linguists’ names or initials • genres, e.g. wordlists, narratives, … • - AND STICK WITH IT! OLAC Tutorial, Jan.6,2006, Albuquerque

  15. Labelling III: related items Language documentation materials typically come in related sets, or bundles: • recording of a narrative + interlinear text + revised translation + commentary • interview + photographs • recorded elicitation session + field notes • a box of file slips OLAC Tutorial, Jan.6,2006, Albuquerque

  16. Labelling IV: types of relations • derivation: a transcription is derived from a recording • series: a long recording that spans several media (cds only hold 700 mb/80 mins) • part-whole: video & audio recordings made simultaneously of the same event • association: (fuzzy) photographs of the narrator of a recording, commentaries OLAC Tutorial, Jan.6,2006, Albuquerque

  17. Labelling Example 1: AILLA resource ID • ZOH001R040I001.mp3 • ZOH = language code • 001 = deposit number • R040 = 40th resource in that deposit • I001 = 1st item in that resource • .mp3 = what kind of file • Supports our administrative needs: many languages, many resources in each deposit. OLAC Tutorial, Jan.6,2006, Albuquerque

  18. Labelling Example 2: media object is primary • Facilitates keeping track of things in the field. File extensions identify type of item. • cd1t1.wav - cd 1, track 1 • cd1t1.db - the shoebox interlinear database • cd1t1.doc - a word doc w/notes about cd1t1 • ds19.xls - spreadsheet dataset (verb roots) • ds5.db - shoebox dataset (deictics) • nb1 - field notebook OLAC Tutorial, Jan.6,2006, Albuquerque

  19. Labelling Example 3:Humongous file names • “122, Vida en Oluta.wav” • “123, Vida en Oluta, continuación de IV.wav” Note the numbers that uniquely identify these files, and the file extension that tells you what they are OLAC Tutorial, Jan.6,2006, Albuquerque

  20. Labelling Example 4:A tidy CD label OLAC Tutorial, Jan.6,2006, Albuquerque

  21. Labelling Example 4:A tiny, yet nearly complete, label OLAC Tutorial, Jan.6,2006, Albuquerque

  22. Corpus catalog/Metadata I • Catalog information for digital resources is called metadata. • Metadata supports: • keeping related items together • protection of sensitive materials • searching for the thing you want • use of resources by many people • proper citation of archived resources OLAC Tutorial, Jan.6,2006, Albuquerque

  23. Metadata II : Minimum info • Label of corresponding object (cd, notebook). • Creators' full names: you and the speakers. • Language: be specific and/or use the ISO code. • Date of creation: YYYY-MM-DD. • Place of creation: be specific. • Access restrictions, and any special instructions concerning future uses. • Genre keyword, e.g. narrative. OLAC Tutorial, Jan.6,2006, Albuquerque

  24. Metadata III : Additional info • Project info: name, director, sponsor, etc. • Participants’ roles (e.g. narrator), demographic data, contact info • Resource info: provenance, formats, etc. • Content info: descriptions of creation context, content, etc. – the more detail, the better. • References: relevant publications OLAC Tutorial, Jan.6,2006, Albuquerque

  25. Metadata IV Two recommended (interoperable) schemas. Choose either as your base and modify to suit your needs. • OLAC – Open Language Archives Community – http://www.language-archives.org • IMDI – International Standards for Language Engineering Metadata Initiative – http://www.mpi.nl/IMDI OLAC Tutorial, Jan.6,2006, Albuquerque

  26. Metadata Example 1:Courtesy of Jonathan Amith • CD 1 • \file 2002_08_06_CF1_Am • \id NAH001R001I001 • \author Flores Medina, Cristino • \orig Ameyaltepec • \age 68 • \sex male • \date 6 Aug. 2002 • \record_by Amith, Jonathan D. • \genre cuentos • \subgenre secular • \transcribed Inocencio Díaz (primera versión) • \theme Name: "Pedro de la tierra y Pedro del Cielo" • \device Haskins Laboratory recording studio; recorded onto hard disk; • recorded onto hard disk in aiff, converted to .wav • \mike Earthlinks • \format 44,100/16 bits/mono wav • \duration 25:49 (25:48.792 from original file of 26:12.617) • \folder Digital #1 • \diskname C. Flores stories 2002/08 OLAC Tutorial, Jan.6,2006, Albuquerque

  27. Metadata Example 2: Courtesy of Joel Sherzer OLAC Tutorial, Jan.6,2006, Albuquerque

  28. Corpus management tools IMDI Browser & IMDI Data entry (http://www.mpi.nl/IMDI) AILLA’s Shoebox 2.0 & 5.0 templates (http://www.ailla.utexas.org/site/download_md_forms_sp.html) • Any database or spreadsheet or Word template that you create. • A looseleaf binder with a standard (xeroxable) form. OLAC Tutorial, Jan.6,2006, Albuquerque

  29. Useful websites • AILLA: http://www.ailla.utexas.org/ • DELAMAN: http://www.delaman.org/ • IMDI: http://www.mpi.nl/ISLE • OLAC: http://www.language_archives.org • EMELD: http://emeld.org • Write to me: ailla@ailla.utexas.org OLAC Tutorial, Jan.6,2006, Albuquerque

More Related