slide1
Download
Skip this Video
Download Presentation
Quick Rich Transcriptions of Arabic Broadcast News Speech Data

Loading in 2 Seconds...

play fullscreen
1 / 21

Quick Rich Transcriptions of Arabic Broadcast News Speech Data - PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on

Quick Rich Transcriptions of Arabic Broadcast News Speech Data. Niklas Paulsson, Djamel Mostefa, Chomicha Bendahman ELDA (Evaluations and Language resources Distribution Agency) Meghan Glenn, Stephanie Strassel LDC (Linguistic Data Consortium). Overview Transcription method Sources

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Quick Rich Transcriptions of Arabic Broadcast News Speech Data' - doria


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Quick Rich Transcriptions of

Arabic Broadcast News Speech Data

Niklas Paulsson, Djamel Mostefa, Chomicha Bendahman

ELDA (Evaluations and Language resources Distribution Agency)

Meghan Glenn, Stephanie Strassel

LDC (Linguistic Data Consortium)

outline
Overview

Transcription method

Sources

Collection

Selection

Transcription

Segmentation, Sentence Units, Overlapping Speech, Markup

Quality Control

Conclusion

Outline
overview
Broadcast News Transcripts

Arabic (MSA, MCA)

Sources: radio + TV, mostly Middle East

Verbatim orthographic transcripts, time-aligned, minimal mark-up

QRTR to reduce time

Overview
transcription method 1
QRTR – Quick Rich Transcription

(QTR / QRTR / CTR)

Amount of detail in markup

Number of features identified

Degree of accuracy

Completeness

Amount of time

Number of quality checks

Transcription Method (1)
sources 1
Two types of recordings:

Broadcast news (BN): talking head style news reports

Broadcast conversation (BC): more interactive, talk shows, interviews, call-in programs, roundtable discussions

Mainly MSA from Middle East

MCA from North Africa and Middle East

Overlapping speech

30 – 60 minutes of recordings

collected from TV and Radio sources

Sources (1)
collection
Sources recorded from satellite

Daily and weekly recordings

Records video stream

Audio extracted from video

Saved in WAV or SPH

16 bits, 16 kHz

Collection
audit
Manual audit of all programs

Procedure:

Listen to 30 sec samples of 3 sections: beginning, middle, end

Auditors can listen to additional segments if necessary

Fills in a form for auditing the recordings

Web-based auditing interface

Checks:

Is there a recording?

Is the audio quality ok?

What is the language?

Is it speech from the right program?

What is the data type?

What is the topic?

Audit
selection
Recordings rejected:

poor quality

wrong language

Passed audit: eligible for transcription

Criteria based on:

data amount

sources

dates

2000 hours in 24 sets

Sent in 20 – 300 hours packages for transcription

Period: Apr. 2004 – Aug. 2007

Selection
transcription
Orthographic, verbatim transcripts

Arabic script

No vowels

Segmented and time stamped

Speaker names

Sentence Units

Noise markers

Overlapping speech

Foreign language markup

Transcription
xtrans 1
Tool for broadcast news and conversation

Multi-lingual (UTF-8)

Multi-platform (Windows, Linux, FreeBSD)

Output TDF format

Compatible with Transcriber format

XTrans (1)
segmentation 1
Segment data into sections:

Speech delimited by pause or silence

Non speech sections: music, silence, ads, etc

Lasts 5 – 20 seconds

Sections are classified as:

News report (BN)

Conversation (BC)

Non-news

Sections are next grouped into speaker turns

Single speaker or overlapping turn

Statement Units (SU)

Speaker ID or name for each turn

Segmentation (1)
sentence units
Group utterances into clusters of words

Each cluster represent a sentence-like unit

Each unit receives a label:

Statement

Question

Incomplete

Non-Speech

Sentence Units
overlapping speech
Many recordings include conversations

Portions of speech that are overlapping

Segmented and annotated

No SU type

Could be quite challenging

Difficult portions annotated as non-speech

Overlapping Speech
markup 1
Minimal set of markers:

Hesitations

Truncated words

Mispronunciations

Made up words

Noise

Difficult speech

Markup (1)
markup 2
Noise markers:

Background noise

Speaker noise: laugh, cough, sneeze, lipsmack

Dialect / language markup:

Non-MSA (MCA)

English

French

Foreign Language

Markup (2)
quality control
Limited quality control due to time constraints

Quick Verification procedure

Max 18 min / file

Focus:

Transcription matches speech

Segmentation

Speaker names

Orthography

Procedure:

Checks 3 segments of 3 min: beginning, middle and end

Transcriptions that did not pass: sent back to transcribers

Quality Control
conclusion
Arabic broadcast data

>2000 hours transcribed

330k words

Useful for quantative manual transcripts

Limited timeframe

Minimal but useful markup

Quality control

Training ASR systems for MT

Conclusion
ad