slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations PowerPoint Presentation
Download Presentation
Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations

Loading in 2 Seconds...

play fullscreen
1 / 15

Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations - PowerPoint PPT Presentation


  • 517 Views
  • Uploaded on

Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations. J áchym Kolář Jan Švec. University of West Bohemia in Pilsen, Czech Republic. Talk Overview. Structural metadata annotation Speech data Statistics about fillers

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations' - ostinmannual


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Structural Metadata Annotation of Speech Corpora:Comparing Broadcast News and Broadcast Conversations

Jáchym Kolář Jan Švec

University of West Bohemia in Pilsen, Czech Republic

talk overview
Talk Overview
  • Structural metadata annotation
  • Speech data
  • Statistics about fillers
  • Statistics about edit disfluencies
  • Statistics about sentence-like units
  • Summary

J. Kolar and J. Svec

structural metadata extraction
Structural Metadata Extraction
  • Metadata Extraction (MDE) research started as part of DARPA EARS program
  • Metadata annotation scheme for MDE introduced by LDC (originally for English  we have extended it to Czech)
  • ULTIMATE GOAL of MDE:

Automatic conversion of raw speech recognition outputto forms more useful to humansand downstream automatic processes

J. Kolar and J. Svec

mde annotation subtasks
MDE Annotation Subtasks
  • Boundaries of syntactic/semantic units (SUs)
    • Statements, Interrogatives, Incompletes
    • Coordination breaks, Clausal breaks
  • Non-content words (fillers):
    • Filled pauses (FPs)
    • Discourse markers (DMs)
  • Speech disfluencies (edits):
    • Deletable regions (DelRegs), Interruption points,

Explicit editing terms, Corrections

J. Kolar and J. Svec

mde annotation example
MDE Annotation Example

but I you know really [pre-]*uhprefer this form [of]*ofum presentation/.[she]*Sheila told me [on Tuesday]*noon Wednesday/, she didn’t/.so let’s move on/, because we [don’t have]*uhdon’t have time/.well do you like [this]*this example/?

but I you know really pre- uh prefer this form of of um presentation she Sheila told me on Tuesday no on Wednesday she didn’t so let’s move on because we don’t have uh don’t have time well do you like this this example

J. Kolar and J. Svec

goal of this paper
Goal of This Paper
  • Analyse and compare two Czech MDE corpora from different domains in terms of metadatastatistics
  • Compare Czech Broadcast News (BN) vs. Broadcast Conversations (BC)
  • Also compare Czech and English MDE corpora – English Broadcast News and Conversational Telephone Speech (CTS)

J. Kolar and J. Svec

czech broadcast news data
Czech Broadcast News Data
  • News from 3 TV channels and 4 radio stations
  • Both public and commercial broadcast companies
  • Differing in presentation style
  • 26 hours of transcribed speech
  • ~ 300 speakers
  • Speech recordings and verbatim transcripts publicly available from LDC

J. Kolar and J. Svec

broadcast conversation data
Broadcast Conversation Data
  • 52 recordings of a Czech radio talk show – Radioforum
  • 24 hours of transcribed speech
  • ~ 100 speakers
  • 1-3 guests spontaneously answer questions asked by 1-2 interviewers
  • Mostly political debates
  • Currently being extended by additional 20 recordings (~10 hours)

J. Kolar and J. Svec

statistics about fillers
Statistics about Fillers
  • Filled pauses more frequent in Czech Broadcast Conversations (3.8% of words) than in News (0.5%)
  • English MDE: CTS – 2.2%, BN – 1.4%
  • Discourse markers also more frequent in Czech Conversations (1.6%) than in News (0.1%)
  • English MDE: CTS – 4.4%, BN – 0.5%

J. Kolar and J. Svec

statistics about edit disfluencies
Statistics about Edit Disfluencies
  • Deletable regions – 2.8% of words in Conversations and 0.2% in News
  • English MDE: 5.4% in CTS and 1.5% in BN
  • Percentage of disfluencies having a correction larger in News (94.6%) than in Conversations (83.8%)
  • Explicit editing terms rare in both corpora –

occur just at 4% of disfluencies

J. Kolar and J. Svec

pos analysis of edit disfluencies
POS Analysis of Edit Disfluencies
  • Tagged the Czech corpora employing an automatic POS tagger
  • Czech uses structured tags with 15 positions;

we only used the first position distinguishing 10 basic POS

  • Computed and compared three POS distributions:
    • Whole corpus
    • Deletable regions only
    • Corrections only

J. Kolar and J. Svec

statistics about sus
Statistics about SUs
  • Average SU length: Conversations (14.5 words) shows longer SUs than News (13.0)
  • English BN (12.5) similar to Czech, but CTS shows much shorter SUs (7.0) than Broadcast Conversations
  • SU-internal breaks (clausal and coordination) more frequent in Conversations than in News

(49% vs. 31% of all SU symbols)

 Complex and compound sentences more common in spontaneous conversations than in prearranged news

J. Kolar and J. Svec

summary
Summary
  • Broadcast Conversations contain significantly more fillers and disfluencies than News
  • Conversations also show longer SUs and contain a higher number of complex sentences than News
  • Deletable regions and corrections in both corpora show different POS distributions in comparison with the general POS distributions
  • We plan to make Czech MDE corpora publicly available

J. Kolar and J. Svec

slide15

Structural Metadata Annotation of Speech Corpora:Comparing Broadcast News and Broadcast Conversations

Jáchym Kolář Jan Švec

University of West Bohemia in Pilsen, Czech Republic