Quick Rich Transcriptions of Arabic Broadcast News Speech Data

Quick Rich Transcriptions of Arabic Broadcast News Speech Data Niklas Paulsson, Djamel Mostefa, Chomicha Bendahman ELDA (Evaluations and Language resources Distribution Agency) Meghan Glenn, Stephanie Strassel LDC (Linguistic Data Consortium)

Overview Transcription method Sources Collection Selection Transcription Segmentation, Sentence Units, Overlapping Speech, Markup Quality Control Conclusion Outline

Broadcast News Transcripts Arabic (MSA, MCA) Sources: radio + TV, mostly Middle East Verbatim orthographic transcripts, time-aligned, minimal mark-up QRTR to reduce time Overview

QRTR – Quick Rich Transcription (QTR / QRTR / CTR) Amount of detail in markup Number of features identified Degree of accuracy Completeness Amount of time Number of quality checks Transcription Method (1)

Transcription Method (2)

Two types of recordings: Broadcast news (BN): talking head style news reports Broadcast conversation (BC): more interactive, talk shows, interviews, call-in programs, roundtable discussions Mainly MSA from Middle East MCA from North Africa and Middle East Overlapping speech 30 – 60 minutes of recordings collected from TV and Radio sources Sources (1)

Sources (2)

Sources recorded from satellite Daily and weekly recordings Records video stream Audio extracted from video Saved in WAV or SPH 16 bits, 16 kHz Collection

Manual audit of all programs Procedure: Listen to 30 sec samples of 3 sections: beginning, middle, end Auditors can listen to additional segments if necessary Fills in a form for auditing the recordings Web-based auditing interface Checks: Is there a recording? Is the audio quality ok? What is the language? Is it speech from the right program? What is the data type? What is the topic? Audit

Recordings rejected: poor quality wrong language Passed audit: eligible for transcription Criteria based on: data amount sources dates 2000 hours in 24 sets Sent in 20 – 300 hours packages for transcription Period: Apr. 2004 – Aug. 2007 Selection

Orthographic, verbatim transcripts Arabic script No vowels Segmented and time stamped Speaker names Sentence Units Noise markers Overlapping speech Foreign language markup Transcription

Tool for broadcast news and conversation Multi-lingual (UTF-8) Multi-platform (Windows, Linux, FreeBSD) Output TDF format Compatible with Transcriber format XTrans (1)

XTrans (2)

Segment data into sections: Speech delimited by pause or silence Non speech sections: music, silence, ads, etc Lasts 5 – 20 seconds Sections are classified as: News report (BN) Conversation (BC) Non-news Sections are next grouped into speaker turns Single speaker or overlapping turn Statement Units (SU) Speaker ID or name for each turn Segmentation (1)

Group utterances into clusters of words Each cluster represent a sentence-like unit Each unit receives a label: Statement Question Incomplete Non-Speech Sentence Units

Many recordings include conversations Portions of speech that are overlapping Segmented and annotated No SU type Could be quite challenging Difficult portions annotated as non-speech Overlapping Speech

Minimal set of markers: Hesitations Truncated words Mispronunciations Made up words Noise Difficult speech Markup (1)

Noise markers: Background noise Speaker noise: laugh, cough, sneeze, lipsmack Dialect / language markup: Non-MSA (MCA) English French Foreign Language Markup (2)

Limited quality control due to time constraints Quick Verification procedure Max 18 min / file Focus: Transcription matches speech Segmentation Speaker names Orthography Procedure: Checks 3 segments of 3 min: beginning, middle and end Transcriptions that did not pass: sent back to transcribers Quality Control

Arabic broadcast data >2000 hours transcribed 330k words Useful for quantative manual transcripts Limited timeframe Minimal but useful markup Quality control Training ASR systems for MT Conclusion

Thanks for your attention

Quick Rich Transcriptions of Arabic Broadcast News Speech Data

Quick Rich Transcriptions of Arabic Broadcast News Speech Data

Presentation Transcript

Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations

Story Segmentation in English Mandarin and Arabic Broadcast News

Automatic Part-of-Speech Tagging of Arabic Text

US Broadcast News: History

BROADCAST NEWS DAY 2

Quick News…

Writing a News Broadcast

Progress in Arabic Broadcast News Transcription at BBN

Summarization of Broadcast News using Speaker Tracking

Broadcast News Training Experiments

Writing Broadcast News Stories

Story Segmentation of Broadcast News

Broadcast News (1987)

Broadcast News Refresher Lecture

Overview of QAST 2008 - Question Answering on Speech Transcriptions -

Overview of QAST 2007 - Question Answering on Speech Transcriptions -

Unsupervised Training Using Large Amounts of Arabic Broadcast News Audio Data

Get Rich Quick

GET RICH QUICK CO.

Broadcast News Writing

Writing News for Broadcast

Story Segmentation of Broadcast News