Querying spoken language corpora
1 / 27

Querying Spoken Language Corpora - PowerPoint PPT Presentation

  • Uploaded on

Querying Spoken Language Corpora. Thomas Schmidt IDS Mannheim. Outline. Background: EXMARaLDA, FOLKER, AGD, DGD2 Transcription: Data models, data formats, TEI Corpora: Recordings, transcripts, metadata Query requirements Query technologies Demo Future directions. Background.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Querying Spoken Language Corpora' - ivrit

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Querying spoken language corpora

Querying Spoken Language Corpora

Thomas Schmidt

IDS Mannheim



  • Background: EXMARaLDA, FOLKER, AGD, DGD2

  • Transcription: Data models, data formats, TEI

  • Corpora: Recordings, transcripts, metadata

  • Query requirements

  • Query technologies

  • Demo

  • Future directions



  • EXMARaLDA: System for building and querying spoken language corpora

  • Used in many individual projects, at the HZSK CLARIN Centre

  • Transcription editor, Corpus management tool, query tool EXAKT

  • FOLKER: Transcription tool – same technical basis, optimised for Research and Teaching Corpus of Spoken German (FOLK)



  • Archive for Spoken German (AGD): central archive for oral corpora in Germany, IDS Mannheim

  • Dialect corpora, conversation corpora

  • Database for Spoken German (DGD2): access (browsing and query) for AGD data

Model single timeline multiple tiers
Model: Single timeline, multiple tiers

  • Annotation tuples: text label + timeline reference

  • Timeline: fully ordered, reference to a recording

  • Tiers: collections of annotations of a specific category, a specific speaker, annotations in a tier do not overlap

     Annotation Graph Framework (Bird/Liberman 2001)

EXMARaLDA Basic Transcription:

(Flat) hierarchy of events in tiers

Use of ID and IDREFS to encode temporal relations

No additional markup, no „deep“ semantics






Data formats
Data formats

  • Schmidt, Loehr et al. (2008): An exchangeformatfor multimodal annotations.

    • XML formatfordataexchangebetweenseventoolswith STMT datamodels

       improvesinteroperabilityfordatacreation

  • Drawbacks

    • nodocumentorder (non-linear, non-hierachical)

    • whatisthe „fulltext“ / the „primarydata“ / the „characterdata“?

    • no explicit representationofdependencies

    • temporal structure, not linguisticstructure

       badforquerying?

Stmt to ohco transformation1
STMT to OHCO transformation

  • Segment chain = any temporally connected chain of annotations within one tier

  • Assumption: all other hierarchical structure beneath the level of segment chains

  • Correspondence: segment chain ↔ <u>

Unparsed (EXAKT)

Parsed (DGD2)

Free annotation (EXAKT)

Token annotation (DGD2)

  • Schmidt (2011): A TEI-based Approach to Standardising Spoken Language Transcription. jTEI (1)

  • Romary, Witt, Schmidt: ISO/DIN PWI 24624: TranscriptionOf Speech

Transcripts recordings metadata
Transcripts, recordings, metadata

  • Interaction metadata

    • date, „genre“, place, degree of formality, etc.

    • pertains to a (set of) transcription(s)

  • Speaker metadata

    • age, sex, language biography, speech impediments, etc.

    • pertains to (a) part(s) of a transcription

  • Audio and video recordings

    • for checking transcription quality

    • for obtaining information not encoded in transcripts

  • Transcripts

    • not (the) primary data!

    • a „convenient index into the recording“?

    • selective, theory-dependent, …


AGD Corpora: 8 mill. tokens

CGN Corpus: 9 mill. tokens

BNC Spoken: 10 mill. tokens

MICASE: 2 mill. tokens

Most other corpora: < 1 mill. Tokens

(at least) one order of magnitude smaller than written corpora

Query speed is (not that) important

  • „In informal conversation in Northern Scotland, older female speakers tend to use ‚aye‘ as a backchannel signal with a rising intonation“

    • Situational context  Interaction metadata

    • Speaker metadata

    • Text data / Surface form  Transcript text

    • Interactional context  Temporal transcript structure

    • Prosodic properties  Recording

      Requirement #1: Access to all types of context

      Requirement #2: (Manual) postprocessing of query results

  • „After a cut-off wordfollowedby a pause ofmorethan 0.3 seconds, thecut-off wordisfrequentlyrepeated“

    • specialwordtokens (incompletewords, semi-lexical material, …)

    • non-wordtokens (pauses, non-verbal articulations, …)

    • temporal measurements (pause length)

      Requirement #3: Queriesfor „special“ tokens

      Requirement #4: Querieswithspecialproperties (numericalvalues, repetition)

  • Filledpausesarelessfrequent in overlappingspeechthanatthebeginningofturns“

  • „Modal particlesand modal adverbsoftenoccurnearoneanother in an utterance“ vs. „Filledpausesoccurmorefrequentlynearanotherspeaker‘sbackchannel“

    Requirement #5: Queriesforposition in temporal structure

    Requirement#6: Multiple distancemeasures, queryscopes


  • Requirements

    Access to all typesofcontext

    Manual post-processingofqueryresults



    Queriesforposition in temporal structure

    Multiple distancemeasures, queryscopes




Query result






    • Regular expression on „full text“ of <u>

    • (XPath on <u> with markup)

    • (XSL on transcripts)

  • DGD2

    • Oracle full text on documents

    • SQL on <w> with attributes

  • Demo 1: EXAKT with HaMaTaC corpus

  • HaMaTaC: Hamburg Map Task Corpus

    • advanced L2 learners of German

    • solving a map task

    • Orthographic transcription with lemma, POS, disfluency annotation

Demo 2: DGD2 with FOLK Corpus

FOLK: Research & Teaching Corpus of Spoken German

Future directions:

Support a „real“ query language: CQL

CQPWeb as a test case

User survey DGD2 (approaching 2000 users!)

TEI as common ground

for different spoken language corpora query platforms?

for querying spoken and written data side-by-side?