Querying spoken language corpora
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

Querying Spoken Language Corpora PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on
  • Presentation posted in: General

Querying Spoken Language Corpora. Thomas Schmidt IDS Mannheim. Outline. Background: EXMARaLDA, FOLKER, AGD, DGD2 Transcription: Data models, data formats, TEI Corpora: Recordings, transcripts, metadata Query requirements Query technologies Demo Future directions. Background.

Download Presentation

Querying Spoken Language Corpora

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Querying spoken language corpora

Querying Spoken Language Corpora

Thomas Schmidt

IDS Mannheim


Outline

Outline

  • Background: EXMARaLDA, FOLKER, AGD, DGD2

  • Transcription: Data models, data formats, TEI

  • Corpora: Recordings, transcripts, metadata

  • Query requirements

  • Query technologies

  • Demo

  • Future directions


Background

Background

  • EXMARaLDA: System for building and querying spoken language corpora

  • Used in many individual projects, at the HZSK CLARIN Centre

  • Transcription editor, Corpus management tool, query tool EXAKT

  • FOLKER: Transcription tool – same technical basis, optimised for Research and Teaching Corpus of Spoken German (FOLK)


Background1

Background

  • Archive for Spoken German (AGD): central archive for oral corpora in Germany, IDS Mannheim

  • Dialect corpora, conversation corpora

  • Database for Spoken German (DGD2): access (browsing and query) for AGD data


Model single timeline multiple tiers

Model: Single timeline, multiple tiers

  • Annotation tuples: text label + timeline reference

  • Timeline: fully ordered, reference to a recording

  • Tiers: collections of annotations of a specific category, a specific speaker, annotations in a tier do not overlap

     Annotation Graph Framework (Bird/Liberman 2001)


Querying spoken language corpora

EXMARaLDA Basic Transcription:

(Flat) hierarchy of events in tiers

Use of ID and IDREFS to encode temporal relations

No additional markup, no „deep“ semantics


Querying spoken language corpora

EXMARaLDA

ELAN


Querying spoken language corpora

EXMARaLDA

ELAN

Praat


Data formats

Data formats

  • Schmidt, Loehr et al. (2008): An exchangeformatfor multimodal annotations.

    • XML formatfordataexchangebetweenseventoolswith STMT datamodels

       improvesinteroperabilityfordatacreation

  • Drawbacks

    • nodocumentorder (non-linear, non-hierachical)

    • whatisthe „fulltext“ / the „primarydata“ / the „characterdata“?

    • no explicit representationofdependencies

    • temporal structure, not linguisticstructure

       badforquerying?


Stmt to ohco transformation

STMT to OHCO transformation


Stmt to ohco transformation1

STMT to OHCO transformation

  • Segment chain = any temporally connected chain of annotations within one tier

  • Assumption: all other hierarchical structure beneath the level of segment chains

  • Correspondence: segment chain ↔ <u>


Querying spoken language corpora

Unparsed (EXAKT)

Parsed (DGD2)


Querying spoken language corpora

Free annotation (EXAKT)

Token annotation (DGD2)


Querying spoken language corpora

  • Schmidt (2011): A TEI-based Approach to Standardising Spoken Language Transcription. jTEI (1)

  • Romary, Witt, Schmidt: ISO/DIN PWI 24624: TranscriptionOf Speech


Transcripts recordings metadata

Transcripts, recordings, metadata

  • Interaction metadata

    • date, „genre“, place, degree of formality, etc.

    • pertains to a (set of) transcription(s)

  • Speaker metadata

    • age, sex, language biography, speech impediments, etc.

    • pertains to (a) part(s) of a transcription

  • Audio and video recordings

    • for checking transcription quality

    • for obtaining information not encoded in transcripts

  • Transcripts

    • not (the) primary data!

    • a „convenient index into the recording“?

    • selective, theory-dependent, …


Corpora

Corpora


Corpora1

Corpora

AGD Corpora: 8 mill. tokens

CGN Corpus: 9 mill. tokens

BNC Spoken: 10 mill. tokens

MICASE: 2 mill. tokens

Most other corpora: < 1 mill. Tokens

(at least) one order of magnitude smaller than written corpora

Query speed is (not that) important


Querying spoken language corpora

  • „In informal conversation in Northern Scotland, older female speakers tend to use ‚aye‘ as a backchannel signal with a rising intonation“

    • Situational context  Interaction metadata

    • Speaker metadata

    • Text data / Surface form  Transcript text

    • Interactional context  Temporal transcript structure

    • Prosodic properties  Recording

      Requirement #1: Access to all types of context

      Requirement #2: (Manual) postprocessing of query results


Querying spoken language corpora

  • „After a cut-off wordfollowedby a pause ofmorethan 0.3 seconds, thecut-off wordisfrequentlyrepeated“

    • specialwordtokens (incompletewords, semi-lexical material, …)

    • non-wordtokens (pauses, non-verbal articulations, …)

    • temporal measurements (pause length)

      Requirement #3: Queriesfor „special“ tokens

      Requirement #4: Querieswithspecialproperties (numericalvalues, repetition)


Querying spoken language corpora

  • „Filledpausesarelessfrequent in overlappingspeechthanatthebeginningofturns“

  • „Modal particlesand modal adverbsoftenoccurnearoneanother in an utterance“ vs. „Filledpausesoccurmorefrequentlynearanotherspeaker‘sbackchannel“

    Requirement #5: Queriesforposition in temporal structure

    Requirement#6: Multiple distancemeasures, queryscopes

    […]


Querying spoken language corpora

  • Requirements

    Access to all typesofcontext

    Manual post-processingofqueryresults

    Queriesforspecialtokens

    Querieswithspecialproperties

    Queriesforposition in temporal structure

    Multiple distancemeasures, queryscopes


Querying spoken language corpora

Postprocessing

Query

Transcripts

Query result

Corpus

Recordings

Metadata

Context


Querying spoken language corpora

  • EXAKT

    • Regular expression on „full text“ of <u>

    • (XPath on <u> with markup)

    • (XSL on transcripts)

  • DGD2

    • Oracle full text on documents

    • SQL on <w> with attributes


Querying spoken language corpora

  • Demo 1: EXAKT with HaMaTaC corpus

  • HaMaTaC: Hamburg Map Task Corpus

    • advanced L2 learners of German

    • solving a map task

    • Orthographic transcription with lemma, POS, disfluency annotation


Querying spoken language corpora

Demo 2: DGD2 with FOLK Corpus

FOLK: Research & Teaching Corpus of Spoken German


Querying spoken language corpora

Future directions:

Support a „real“ query language: CQL

CQPWeb as a test case

User survey DGD2 (approaching 2000 users!)

TEI as common ground

for different spoken language corpora query platforms?

for querying spoken and written data side-by-side?


  • Login