1 / 121

Archiving

Archiving. David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London. Topics. Introducing ELAR and digital language archives Preservation Archive interactions with documentation What and how to archive Protocol Metadata

jacob
Download Presentation

Archiving

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Archiving David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London

  2. Topics • Introducing ELAR and digital language archives • Preservation • Archive interactions with documentation • What and how to archive • Protocol • Metadata • Evaluation of audio • Archives and revitalisation • Archivism : mobilisation • Video • Conclusions

  3. Introducing ELAR and digital language archives

  4. Endangered Languages ARchive (ELAR) • one of 3 semi-autonomous programs of the Hans Rausing Endangered Languages Project • staff of 3; archivist, software developer, technician, (research assistants etc) • develop preservation infrastructure, cataloguing and dissemination; policies; facilities; training and advice; materials development and publishing

  5. What is a digital language archive? • a trusted repository created and maintained by an institution with a commitment to the long-term preservation of archived material • will have policies and processes for materials acquisition, cataloguing, preservation, dissemination, migration to new digital formats • a collection of managed materials

  6. What is archiving of language materials? • preparing materials in a structured form suitable for long-term preservation • creating long-term relationships • it is not backup • it is not dissemination/publication • it should not impinge on good linguistic practice

  7. What can a language archive offer? • Security - keep your electronic materials safe • Preservation - store your materials for the long term • Discovery - help others to find out about your materials • Protocols - respect and implement sensitivities, restrictions • Sharing - share results of your work, if appropriate • Acknowledgement - create citable acknowledgement • Mobilisation - create usable language materials for communities • Quality and standards - advice for assuring your materials are of the highest quality and robust standards

  8. Kinds of language archives • many cross-cutting classifications: • Indigenous vs outsider, eg. Squamish Nation • regional vs international, eg. AILLA, Paradisec; DoBeS, ELAR • associated with research institute, eg. AIATSIS, ANLC • granter-funded, eg. DoBeS, ELAR, OTA • digital vs physical vs mixed, eg. DoBeS vs Vienna Sound Archive, ANLC

  9. Potential users • speakers and their descendants - up to 95% of users of UCB are community members • depositors - to create or renew materials • other researchers - comparative/historical linguists, typologists, theoreticians, anthropologists, historians, musicologists etc etc • other “stakeholders”, eg educationalists • journalists and the wider public

  10. Archives networks and bodies • Digital Endangered Languages and Archives Network (DELAMAN) • ELAR, DOBES, ANLC, Paradisec, EMELD, LACITO, AIATSIS, AMPM (Maori) • Open Language Archives Community (OLAC) • others, eg. D-LIB • http://www.dlib.org/ • Open Archives Initiative

  11. afd_34 afd_34 afd_34 afd_34 afd_34 dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds Digital archive architectures • OAIS archives define three types of ‘packages’ ingestion, archive, dissemination: Producers Ingestion Archive Dissemination Designated communities

  12. afd_34 afd_34 afd_34 afd_34 afd_34 dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds ‘Live Archives’ - architecture • Boundary between depositors, users and archive: • users add, update content; customise outputs Producers Ingestion Archive Dissemination Designated communities

  13. The way we were ... • eg 1993: ASEDA Aboriginal Studies Electronic Data Archive at AIATSIS Canberra (modelled on Oxford Text Archive) • opportunistically collect and catalogue electronic materials that were at risk or not accessible • lexica • grammars • texts • etc

  14. How things have changed .. • types of data (modalities and some genres) • means of storage • standardisation and metadata • dissemination • (most explosive) expanded into practice and workflow of linguists

  15. ELAR’s holdings • ELAR currently holds about 45 deposits with a total volume of approx 1.1 TB. • the average deposit is about 25 GB, however, the sizes vary widely, with a few much larger deposits. The median size is around 10GB • we expect volume to nearly double over the next year • see next slides for distribution of data types

  16. ELAR holdings by data type • data types for a representative sample (70%) of holdings • data type by volume (MB) and number of files, sorted by volume

  17. If you are a depositor, ELAR will • preserve your deposited materials • provide for making changes where possible • provide web-based metadata management • implement your access restrictions etc • give feedback about materials • provide advice, general and specific • assistance, eg data conversion • provide some equipment and services • on a case by case basis, develop resources

  18. Preservation

  19. Preservation issues • making materials robust • making storage robust • organisational, ownership and policy issues • changing technologies • refreshing • migrating

  20. Changing technologies • advantages of digital preservation • primarily: copying • items no longer unique • also transmission, dissemination • other implications • robust formats (standard, open, explicit) • formats with long horizons • formats easy to refresh • formats that don’t require particular software (sometimes software is intrinsic!) • may have to describe software or even archive the software

  21. Two preservation models • “preserve the bytestream” • keep the exact original at all costs • LOCKSS • “lots of copies keep stuff safe” • http://lockss.stanford.edu/ • guess which community it came from!

  22. Some backup issues • risk management • undetected problems and useless backups • aspects of professional backup: • scheduled frequencies, eg monthly, weekly, daily • retention • media and locations • naming/versions • proven restoration

  23. Top 10 worst ways to collect/manage data • 1. No backup • 2. Divergent versions of same data • 3. Unlabeled disks/media • 4. Non-standard or undocumented filenames • 5. Master recordings used to review/analyse data • 6. Don’t know how characters are encoded • 7. Never tried to convert/export data • 8. Unprocessed or unedited audio and video • 9. Inconsistent recording • 10. Unmonitored recording

  24. Archive interactions with documentation

  25. Documenter and archive interactions • grant formulation and application • communications, questions, advice • training • archiving services

  26. Documenter & archive interactions

  27. Query/interaction topics • analysis of approx 150 queries from documenters/linguists over nearly 2 years

  28. What and how to archive

  29. What can you archive (at ELAR)? • media - sound, video • graphics - images, scans • text - fieldnotes, grammars, description, analysis • structured data - aligned and annotated transcriptions, databases, lexica • metadata - structured, standardised contextual information about the materials

  30. Archive objects • informed by traditions, eg document archives • sometimes called “resources”, bundles • it could be a file, a set of files, a directory, a “session” or a coherent item with many parts • should have archival qualities eg Bird & Simons “7 Dimensions” (or see Thieberger in LDD2) • may impose standard structures or formats • need deposit event and processes • legal and protocol • verification • accession • ongoing processes

  31. Archive objects should be selected • example: video: How much volume allocated? • answer: ... • however, e.g.: • unlikely that linguist is in position to plan and consistently create excellent video, so selection is unavoidable • data has always been selected!

  32. (... selection) • in your typical work you also: • selected • labeled • transformed/processed/edited • added, corrected, expanded • made links • made or assumed relationships between “whole” and processed units; invented labels, IDs, scope etc • imposed formats

  33. Data portability • Bird and Simons 2003: (for language documentation) our data should have integrity, flexibility, longevity and utility

  34. Data portability • complete • explicit • documented • preservable • transferable • accessible • adaptable • not technology-specific • (also appropriate, accurate, useful etc!!)

  35. Formats - media - preferred • sound - WAV • image - BMP, TIFF, JPEG • video - MPEG2

  36. Formats - documents - preferred • plain text, with or without markup • PDF (PDF/A) • XML, other systematic markup (with description of markup system) • well-structured documents in common Office formats - ELAR will eventually convert them to archive formats • character encoding : • preferred encoding is ASCII or Unicode • clearly document any other encodings used, e.g. ISO 8859-5 • discuss with us if you use font substitution to handle non-Roman characters

  37. Formats - characters - preferred • character encoding : • ASCII or Unicode (UTF-8) • you must clearly document any other encodings used, e.g. ISO 8859-9 • discuss with us if you use font substitution to handle non-Roman characters

  38. Filenames and directories • characters [A-Z], [a-z], [0-9], underscore and a single full stop before the extension • correct MIME extension • favour lower case letters • maximum length 30 characters • maximum directory depth 8 • = ASCII only, no spaces

  39. Semantics of filenames • don’t stuff meaningful information into filenames - use metadata instead • versions • use directory structures wisely

  40. Data format duty cycle examples

  41. Evaluation and conversion examples

  42. Characters • did my characters comethrough? • answer: ... • however: • perhaps ELAR should do it? hápa ki hená mázaska wikcémna núpa iyóphe-wa-ye kst DBW wóz?az?a-s?ni yeló DB OK wash things-NEG ASS.M 'he didn't do the wash' wózaza-sni yeló DB OK wash things-NEG ASS.M 'he didn't do the wash'

  43. Preservation • Is my file preservable? • Note: • characters? • inconsistent segmentation • data as comments • conventions/metadata Text transcription: “Korimáka” Language: Choguita Rarámuri Language used for transcription: Spanish Consultant: Luz Elena León Ramírez Linguist: abriela Cabaero Transcription: erth Fuen & Gabrela Cabaero Date recorded: 11/02/2006 Date tranbscribed: 11/02/2006 Recording: rec6-LEL.wav

  44. Knowledge representation 1 - before wama momol chi naron mon chayako (LB) / wama momol chi naron chayako (MD) wama momol chi nan mon chayako (more emphatic(LB) / wama momol chi nan chayako (MD) Why don't you and him do it? + Notes have both of these sentences without the negator mon. OK runon naynangkroy ile ri He ate their sago. * kipin kannangkroy ngolu intended: We ate their cassowary. OK kipin kanangkroy ngolu We ate their cassowary.

  45. Knowledge representation 1 - after * kipin kannangkroy ngolu intended: We ate their cassowary. OK kipin kanangkroy ngolu We ate their cassowary. <sentence.set num="75"> <version> <walman>Kipin kannangkroy ngolu</walman> <judgement>*</judgement> </version> <english>We ate their cassowary. </english> </sentence.set> <sentence.set num="76"> <version> <walman>Kipin kanangkroy ngolu</walman> <judgement>OK</judgement> </version> <english>We ate their cassowary.</english> </sentence.set>

  46. Knowledge representation 2 • avoid generic software “convert to XML” <?xml version=“1.0” encoding=“UTF-8”?> <FMPXMLRESULT xmlns=“http://www.filemaker.com/fmpxmlresult”> <PRODUCT BUILD=“06/26/2002” NAME=“FileMaker Pro” VERSION=“6.0v2”/> <DATABASE DATEFORMAT=“M/d/yyyy” LAYOUT=““ NAME=“Videos” RECORDS=“13” TIMEFORMAT=“h:mm:ss a”/> <METADATA> <FIELD EMPTYOK=“YES” MAXREPEAT=“1” NAME=“Index name” TYPE=“TEXT”/> <FIELD EMPTYOK=“YES” MAXREPEAT=“1” NAME=“Image desc” TYPE=“TEXT”/> <FIELD EMPTYOK=“YES” MAXREPEAT=“1” NAME=“Date” TYPE=“TEXT”/> <FIELD EMPTYOK=“YES” MAXREPEAT=“1” NAME=“Content” TYPE=“TEXT”/> </METADATA> <RESULTSET FOUND=“13”> <ROW MODID=“16” RECORDID=“40”> <COL><DATA>Morly Beeta</DATA></COL> <COL><DATA>Interview with Morly Beeta</DATA></COL> <COL><DATA>Jan/13/05</DATA></COL> <COL><DATA>Obu history by Morly Beeta</DATA></COL> </ROW>

  47. ELAR conversion - original

  48. ELAR conversion - XHTML

  49. ELAR conversion - XHTML

More Related