1 / 19

Customizing the IMDI metadata schema for endangered languages

Customizing the IMDI metadata schema for endangered languages. Heidi Johnson (AILLA) Arienne Dwyer (DOBES). Introduction. IMDI: International Standards for Language Engineering Metadata Initiative DOBES: Volkswagen Foundation’s Documentation of Endangered Languages initiative

nicola
Download Presentation

Customizing the IMDI metadata schema for endangered languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Customizing the IMDI metadata schema for endangered languages Heidi Johnson (AILLA) Arienne Dwyer (DOBES)

  2. Introduction • IMDI: International Standards for Language Engineering Metadata Initiative • DOBES: Volkswagen Foundation’s Documentation of Endangered Languages initiative • AILLA: the Archive of the Indigenous Languages of Latin America

  3. Types of resources • Audio and video recordings in various digital formats • Annotation text files, e.g. transcriptions and translations • Standalone texts, e.g. dictionaries, poetry • Wide range of genres: from verbal art to scholarly analyses

  4. Bundles of resources • Session (IMDI, 2001): resources resulting from a linguistic elicitation session - recordings and annotations. • Only models one kind of resource production - a recording session. • Collections will include a greater variety of resources, in sets of related materials.

  5. Types of bundles • Canonical bundle: the original session. A digitized recording, in different formats, and some textual annotation files, also in different formats. • Minimal bundle: a single file. Examples: dictionary, poem, recording of uninterpretable chants. • Meta-bundle: a bundle containing other bundles. Example: a book about a set of annotated recordings.

  6. Bundle elements • Current: • Name of bundle • Date and place of production • Proposed: • Resource relations • Date archived • Last modified

  7. Project Collector Content Participants Resources References Major subschemas

  8. The Content Subschema • Genre is the top-level category: • Interaction: conversation, interview … • Explanation: description, recipe … • Performance: narrative, poem, oratory … • Teaching: primer, textbook … • Analysis: grammar, dictionary …

  9. Other Content categories • Modality: speech, writing, gesture • Communication context: • Interactivity • Planning • Involvement • Languages • Task • Description • Keys

  10. AILLA’s Content Keys • Register: a characterization of how the discourse reflects the social context. Example: honorific speech • Style: about poetic and stylistic effects. Examples: parallelism, metered verse.

  11. The Project subschema • Current elements: • Name: a nickname or acronym • Title: official title • ID: a unique identifier • Contact information • Proposed element: • Funder: name of funding organization

  12. The Collector subschema • AILLA renames this Depositor, since this is the individual we have to keep track of (e.g. for Level 3 access permission). When the Depositor is not also the Collector, Collector can be listed under Participants.

  13. The Participants subschema • Type: functional role, e.g. creator • Role: family relationship • Name/Full name • Language(s) • Ethnic group, age, sex: • Education • Anonymous: True if participant’s Full name is reserved; False otherwise

  14. AILLA additions to Participants • Origin: Place (country, region, etc) of origin of the creator of the primary resource in the bundle (e.g. the speaker whose voice is recorded). • Occupation: Can be relevant in assessing accuracy of some kinds of data.

  15. The Resources subschema • Resources contains information about formats and provenance of files in a bundle. • Media Files: audio, video, etc. • Annotation Files: text files. • Proposal: call them all Media Files, to reduce redundancy in the database. (All have URL, size, etc. elements.)

  16. Text resources • Current elements: • Type: type of annotation, e.g. phonetic transcription. • Content encoding: annotation encoding scheme, e.g. EUROTYP. • Character encoding: character set(s) used in a text file.

  17. Text resources 2 • Proposed elements: • Transcription type • Translation (aka Glossing) type • Software: used to produce transcriptions, translations, other annotations (e.g. Shoebox) • Describe Annotator in Participants (along with Translator, etc.)

  18. Proposed subschema • Place: composed of several elements: • Continent • Country • Region • Subregion (address) • Repeated at least twice, in Bundle and in Participants (Origin). • Might also be useful in the Language subschema.

  19. Conclusion • IMDI schema is a flexible tool. • Customization through Key/Value pairs allows local modifications. • Most of the proposed changes are terminological, moving from the DOBES in-house terminology to more general usage.

More Related