1 / 82

ELAR and Digital Archiving for Documentation of Endangered Languages

ELAR and Digital Archiving for Documentation of Endangered Languages. David Nathan Endangered Languages Archive SOAS University of London LingDy Feb 15, 2013. What is a digital language archive?.

faxon
Download Presentation

ELAR and Digital Archiving for Documentation of Endangered Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ELAR and Digital Archiving for Documentation ofEndangered Languages David Nathan Endangered Languages Archive SOAS University of London LingDy Feb 15, 2013

  2. What is a digital language archive? • a trusted repository created and maintained by an institution with a commitment to the long-term preservation of archived material • has policies and processes for acquiring, cataloguing, preserving, disseminating, and migrating (updating formats) • a platform for building and supporting relationships between data providers and data users

  3. General archiving functions • advise • acquire • preserve • add value • provide access • develop trust

  4. Why is language archiving different? • what is a language? • unlike business data, it is not conventionalised (like $, age, year of publication etc) – what and how to code? • varying and competing expectations

  5. And endangered languages archiving? • extremely diverse context – languages, cultures, communities, individuals, projects • typical source - fieldworkers • typical materials - documentation • difficult for archive staff to manage • sensitivities and restrictions

  6. What can a language archive offer? • Security - keep your electronic materials safe • Preservation - store your materials for the long term • Discovery - help others to find out about your materials, and you to find out about users • Protocols - respect and implement sensitivities, restrictions • Sharing - share results of your work, if appropriate • Acknowledgement - create citable acknowledgement • Mobilisation - create usable language materials • Quality and standards - advice for assuring your materials are of the highest quality and robust standards

  7. There are different kinds of language archives • from local to global - different coverage, contexts, methods, collection policies • consider placing your materials in more than one … • there are also sites for aggregating different archives’ holdings, eg Virtual Language Observatory, OLAC

  8. Why digital? • preservation: digitisation is the only way that audio and video (non-symbolic material) can be preserved for the future … because it can be copied and transmitted with zero loss • also good for cataloguing, sharing, dissemination, repurposing

  9. Digital disadvantages • digital data is fragile and ephemeral • cost (human, equipment, maintenance) • requires strategy and luck to get right • preservation depends on file and data formats • depend on tools and software • some formats require particular software (can we archive the software?) • formats: prefer standard, stable, open, explicit, long-lasting • some materials may have to be ‘migrated’

  10. What do depositors have to do? • select and contact an archive • prepare materials • select • structure • suitable encodings and formats • complete metadata, metadocumentation, agreements • send materials to archive(s) • work with archive during curation etc • ongoing management, updating, dissemination

  11. afd_34 afd_34 afd_34 afd_34 afd_34 dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds OAIS model • OAIS archives define three types of ‘packages’ ingestion, archive, dissemination: Producers Ingestion Archive Dissemination Designated communities

  12. afd_34 afd_34 afd_34 afd_34 afd_34 dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds & dfa dfadf fds fdafds request contribute edit give access ELAR - architecture • reduced boundaries between depositors, users and archive: • users add, update content; negotiate access Archive Producers Users

  13. Redefining the digital EL archive • a platform for developing and conducting relationships between knowledge producers and knowledge users– a social networking archive • level the playing field between researchers and community members/other stakeholders • encourage, recognise and cater for diversity

  14. Data management and archiving • use good data management practices whether or not you plan to archive materials • document decisions, steps, conventions, structures, encodings • appropriate and conventional data encoding methods (e.g. Unicode) • be explicit and consistent • plan for flowing data, working with others, across different systems (cf Bird and Simons, ‘Seven Dimensions of Portability’) • good data management practices will make a future archiving process easier and better

  15. Users and potential users • depositors – deposit, access or update materials • speakers and their descendants • other researchers - comparative/historical linguists, typologists, theoreticians, anthropologists, historians, musicologists etc etc • other “stakeholders”, eg educationalists, funders • journalists and the wider public

  16. ELAR facts and figures • archived collections: ~200 • online (published) collections: 150 • average collection size about 80 GB • online data bundles: ~25,000 • online bundles access: unrestricted 10,000, restricted 15,000 • total number of files held: around 200,000 • total volume of files held: around 10 TB • registered users: ~800 • annual number of website "hits": 230,000

  17. ELAR facts and figures – users • increasing number of community members, including Aleut (Canada), Tai-Ahom, Wadar (India), Burushaski (Pakistan), Serrano, Cahuilla, Arapaho (USA), Iraqi Jewish (Iraq), Saami (Finland), Wabena (Tanzania), Torwali (Pakistan), Hani, Bai (China), Irish • comments: “I found your site while looking up my grandmother, and i found her on your site speaking our language. and i would love for my children her great grandchildren to hear our language coming from her". • many interdisciplinary researchers, particularly archivists and anthropologists

  18. Our task • … to preserve and disseminate documentation of endangered languages

  19. Why is this important? • over 50% of the world’s 7000 languages: • are endangered • likely to cease to be spoken this century • little or nothing known about the majority of them • language documentations and the archives that support, preserve, and disseminate them, will become the means of transmission of many languages

  20. A perfect storm? documentation performed by and for linguists and “others” documentation methods exposesensitivities & vulnerabilities “open data” – push for unmoderated access “big data” – resources channeledto analysis, broader audiences

  21. Protocol • the sensitivities and access restrictions associated with EL resources • need to be discussed, collected and recorded in the field

  22. Protocol and access control • principles: • granularity – file, bundle or collection • access is a relation between object and user • protocol values can be changed over time • ELAR’s URCS system • User • Researcher • Community member • Subscriber

  23. ELAR’s protocol values • U – resource available to all registered users • R – resource available to users registered as researchers • C – resource available to users endorsed as members of relevant language community • S – resource available to users who have been given individual access rights for that resource

  24. Subscription application: formal User xx has just applied for access to restricted material in the deposit solega-107128. The following message was attached to the application: "Hi [depositor], Please delegate me for access to the material on Solegas."

  25. Subscription response: formal This email is to inform you that user xx's application for access to restricted material in the deposit musgrave2007tulehu has just been approved. The depositor included the following note to the user: "The researcher is known to me personally and I know that his interest is legitimate."

  26. Subscription application: “curious” User xx has just applied for access to restricted material in the deposit budd2008beirebo. The following message was attached to the application: "I'm xx. I like to learn Bislama language, but never heard what it sounds like. Am very curious "

  27. Subscription application: establish credentials and reason User xx has just applied for access to restricted material in the deposit verstraete2010paman. The following message was attached to the application: "I am currently doing my masters in Linguistics and I'm researching on an endangered language in Malaysia. I would like to see a sample of the data from the fieldwork since I'm not use to this yet. I hope that I can gain more understanding in carrying out the fieldwork."

  28. Subscription response: rejected, with reason This email is to inform you that user xx's application for access to restricted material in the deposit verstraete2010paman has just been rejected. The depositor included the following note: "Dear xx, I am sorry we cannot give you access to this deposit. The Lamalama community has asked us to restrict access to community members. With best wishes, [depositor]"

  29. Subscription response: offering further help This email is to inform you that user xx’s application for access to restricted material in the deposit caballero2009raramuri has just been approved. The depositor included the following note to the user: "Please let me know if you're looking for any specific materials or if you have any questions."

  30. Response: further info and offer to meet This email is to inform you that user xx's application for access to restricted material in the deposit kunbarlang-389 has just been approved. The depositor included the following note to the user: "Hi xx I've approved your access to this collection, but you should know that there is an update in the material I've just deposited, with much more information on both music and texts. I'd be happy to give you access to that when it is processed. Next time I come to London (October or November this year) I'd be happy to meet up if you would like to discuss."

  31. What can you archive (at ELAR)? • media - audio, video • graphics - images, scans • texts - fieldnotes, grammars, description, analysis • structured data - aligned and annotated transcriptions, databases, lexica • metadata, metadocumentation - contextual information about the materials, both structured and unstructured

  32. Archive objects • an “object” could be a file, a set of files, a directory, or a set of files with their relationships explicitly defined • like other archives, ELAR uses a set principle, we call “bundles” (like DoBeS’ sessions) See bundles at ELAR

  33. Archive objects ELAR Collection Collection Collection Collection Bundle Bundle Bundle Bundle File File File File File

  34. What is required to make a deposit? • resource(s) for an endangered language • it could be just one file • catalogue / metadata • deposit form view • existing deposits can also be updated, added to, and metadata added/modified

  35. Archive material should be selected • example: Depositor’s question: How much video can I archive? • answer: ...

  36. How can I deliver data? • hard disks • we return them • we also send them out • flash cards and USB sticks • email • good for samples for evaluation • OK for most text materials • Dropbox etc • a web upload facility may be provided one day • we can download from your server

  37. What about CDs and DVDs? • we have found CDs, andespecially DVDs, to bevery unreliable • DVD fail rate > 10% • cause confusion as filesare allocated to fit on disks, not according to corpus structure • create a lot of work for depositors and for ELAR

  38. Express yourself - Metadata • metadata is • data about data containers • data about data • its functions • for identification, management, retrieval of data • provides the context and understanding of that data • carries those understandings into the future, and to others

  39. Express yourself - Metadata • metadata reflects the knowledge and practices of data providers • … and therefore defines and constrains audiences and usages for the data • all value-adding to recordings of events (annotations transcriptions, translations, glosses, comments, interpretations, part of speech tagging etc) can be considered metadata • data and metadata lie on a spectrum and depend on how they are used rather than being absolutely different things

  40. Express yourself - Metadata • distinguish between • metadata scheme (eg set of categories) and • the way that scheme is expressed

  41. filename: sessions.xls relational filename: sessions.xml <sessions> <session id=”1”> <audio>TRS00065.wav </audio> <transcription>bjt_02.txt</transcription> </session> <session id=”2”> <audio>TRS00066.wav</audio> <transcription>krs_43.txt</transcription> </session> </sessions> tagged

  42. Express yourself - Metadata • example • you could choose categories from OLAC, IMDI etc schemes or formulate your own • this would be a scheme of logical categories (speaker, location, date etc) • you could express these in different language(s) • you could structure the categories and values in different ways, eg as spreadsheet, database, XML

  43. Express yourself - Metadata • you need to choose • a set of metadata categories applying across whole collection + • metadata categories that apply to particular types of objects (eg transcriptions, video), or to individual objects + • ways of expressing and encoding all that metadata

  44. Example • Ju|’hoan (Biesele)

  45. Potential sources of metadata • deposit form • spreadsheets • MS Word tables, CSV etc • IMDI and OLAC XML files • custom XML • notes, correspondence and reports • filenames • direct input to ELAR interface • audio files • images (/captions) • meta-metadata files

More Related