1 / 24

Transcribing and annotating spoken language with EXMARaLDA

Transcribing and annotating spoken language with EXMARaLDA. LREC-Workshop on XML-based richly annotated corpora, Lisbon, 29 May 2004. Thomas Schmidt Sonderforschungsbereich 538 Mehrsprachigkeit University of Hamburg. Richly annotated corpora? Richly annotable corpora? Corpus creation

kirby
Download Presentation

Transcribing and annotating spoken language with EXMARaLDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transcribing and annotating spoken language with EXMARaLDA LREC-Workshop on XML-based richly annotated corpora, Lisbon, 29 May 2004 Thomas Schmidt Sonderforschungsbereich 538 Mehrsprachigkeit University of Hamburg

  2. Richly annotated corpora? • Richly annotable corpora? • Corpus creation • Exchangeability • Framework for things to be annotated?  Framework for annotations CHAT Corpus HIAT-DOS Corpus WordBase Corpus Verbmobil Corpus syncWriter Corpus Transcription framework Annotation framework

  3. Partitur Transcriptions

  4. Partitur Transcriptions • Structural relations: • Temporal sequence

  5. Partitur Transcriptions • Structural relations: • Temporal sequence • Simultaneity

  6. Partitur Transcriptions • Structural relations: • Temporal sequence • Simultaneity • Equivalence („Flat“ annotation)

  7. Single timeline, multiple tiers

  8. Single timeline, multiple tiers

  9. Single timeline, multiple tiers

  10. EXMARaLDA Partitur-Editor Graphical User Interface

  11. EXMARaLDA Partitur-Editor Manipulating tiers, the timeline and events

  12. EXMARaLDA Partitur-Editor Visualization as a wrapped partitur ... as a line transcript ... in column notation

  13. TASX-Annotator

  14. PRAAT

  15. ELAN

  16. Variants of „single timeline, multiple tiers“

  17. Beyond the single timeline

  18. Beyond the single timeline • Simple annotation: Part of speech tagging • each word a single entity • add suitable points to the timeline

  19. Beyond the single timeline Determine order of words (syllables, phonemes, ...) in overlaps or Allow bifurcations of the timeline

  20. Segmentation • EXMARaLDA Basic Transcription • „Single timeline, multiple tiers“ • Intuitive transcription of verbal and non-verbal behaviour • Visualization • Exchange with TASX, PRAAT and ELAN • Simple (utterance level) annotation, e.g.: • Utterance translation • Prosody (Dynamic Modulation etc.) Finite State Machine (HIAT, GAT, DIDA, CHAT, ...) • EXMARaLDA Segmented Transcription • „Bifurcated timeline, multiple tiers“ • Advanced (word, syllable, phoneme level) annotation, e.g.: • POS-Tagging • Morphological transliteration • Intonation contour • Tone

  21. Meta Data EXMARaLDA Corpus Manager (CoMa): Annotation of speakers and whole interactions

  22. Summary • EXMARaLDA Transcription Framework • „Single timeline, multiple tiers“ data model: • Common basis for different existing transcription system • Intuitive, efficient data model suitable for • User-friendly input • Flexible visualization • Simple flat annotations • Exchange with other tools • Extended data model „Segmented transcription“: • Automatically generated from „Basic transcription“ • More advanced flat annotations • Meta data annotation

  23. Open questions 1 • Limitations • Hierarchal annotation (e.g. Phrase structure)? • Discontinued constituents (e.g. German particle verbs)? • „Cross level“ (= cross tier) annotation? • Visualization? Exchange EXMARaLDA Basic Transcription TASX Level 1 ELAN Abstract Corpus Model PRAAT EXMARaLDA Segmented Transcription TASX Level 2 ? ? ? ? ? Annotation graphs

  24. Hierarchy Based Data Models: XML: standardized storage DTDs/Schemas: validity check XSLT: transformation XPath / XQuery: query DOM / NOM: in-memory representation Time based data models: XML: standardized storage How to check validity? How to transform? How to query? AGLIB? Open questions 2 First step: Understand differences and commonalities between existing time- based data models Second step: „Harmonize“ existing time based models

More Related