Quick Rich Transcriptions of
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Quick Rich Transcriptions of Arabic Broadcast News Speech Data PowerPoint PPT Presentation


  • 52 Views
  • Uploaded on
  • Presentation posted in: General

Quick Rich Transcriptions of Arabic Broadcast News Speech Data. Niklas Paulsson, Djamel Mostefa, Chomicha Bendahman ELDA (Evaluations and Language resources Distribution Agency) Meghan Glenn, Stephanie Strassel LDC (Linguistic Data Consortium). Overview Transcription method Sources

Download Presentation

Quick Rich Transcriptions of Arabic Broadcast News Speech Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Quick rich transcriptions of arabic broadcast news speech data

Quick Rich Transcriptions of

Arabic Broadcast News Speech Data

Niklas Paulsson, Djamel Mostefa, Chomicha Bendahman

ELDA (Evaluations and Language resources Distribution Agency)

Meghan Glenn, Stephanie Strassel

LDC (Linguistic Data Consortium)


Outline

Overview

Transcription method

Sources

Collection

Selection

Transcription

Segmentation, Sentence Units, Overlapping Speech, Markup

Quality Control

Conclusion

Outline


Overview

Broadcast News Transcripts

Arabic (MSA, MCA)

Sources: radio + TV, mostly Middle East

Verbatim orthographic transcripts, time-aligned, minimal mark-up

QRTR to reduce time

Overview


Transcription method 1

QRTR – Quick Rich Transcription

(QTR / QRTR / CTR)

Amount of detail in markup

Number of features identified

Degree of accuracy

Completeness

Amount of time

Number of quality checks

Transcription Method (1)


Transcription method 2

Transcription Method (2)


Sources 1

Two types of recordings:

Broadcast news (BN): talking head style news reports

Broadcast conversation (BC): more interactive, talk shows, interviews, call-in programs, roundtable discussions

Mainly MSA from Middle East

MCA from North Africa and Middle East

Overlapping speech

30 – 60 minutes of recordings

collected from TV and Radio sources

Sources (1)


Sources 2

Sources (2)


Collection

Sources recorded from satellite

Daily and weekly recordings

Records video stream

Audio extracted from video

Saved in WAV or SPH

16 bits, 16 kHz

Collection


Audit

Manual audit of all programs

Procedure:

Listen to 30 sec samples of 3 sections: beginning, middle, end

Auditors can listen to additional segments if necessary

Fills in a form for auditing the recordings

Web-based auditing interface

Checks:

Is there a recording?

Is the audio quality ok?

What is the language?

Is it speech from the right program?

What is the data type?

What is the topic?

Audit


Selection

Recordings rejected:

poor quality

wrong language

Passed audit: eligible for transcription

Criteria based on:

data amount

sources

dates

2000 hours in 24 sets

Sent in 20 – 300 hours packages for transcription

Period: Apr. 2004 – Aug. 2007

Selection


Transcription

Orthographic, verbatim transcripts

Arabic script

No vowels

Segmented and time stamped

Speaker names

Sentence Units

Noise markers

Overlapping speech

Foreign language markup

Transcription


Xtrans 1

Tool for broadcast news and conversation

Multi-lingual (UTF-8)

Multi-platform (Windows, Linux, FreeBSD)

Output TDF format

Compatible with Transcriber format

XTrans (1)


Xtrans 2

XTrans (2)


Segmentation 1

Segment data into sections:

Speech delimited by pause or silence

Non speech sections: music, silence, ads, etc

Lasts 5 – 20 seconds

Sections are classified as:

News report (BN)

Conversation (BC)

Non-news

Sections are next grouped into speaker turns

Single speaker or overlapping turn

Statement Units (SU)

Speaker ID or name for each turn

Segmentation (1)


Sentence units

Group utterances into clusters of words

Each cluster represent a sentence-like unit

Each unit receives a label:

Statement

Question

Incomplete

Non-Speech

Sentence Units


Overlapping speech

Many recordings include conversations

Portions of speech that are overlapping

Segmented and annotated

No SU type

Could be quite challenging

Difficult portions annotated as non-speech

Overlapping Speech


Markup 1

Minimal set of markers:

Hesitations

Truncated words

Mispronunciations

Made up words

Noise

Difficult speech

Markup (1)


Markup 2

Noise markers:

Background noise

Speaker noise: laugh, cough, sneeze, lipsmack

Dialect / language markup:

Non-MSA (MCA)

English

French

Foreign Language

Markup (2)


Quality control

Limited quality control due to time constraints

Quick Verification procedure

Max 18 min / file

Focus:

Transcription matches speech

Segmentation

Speaker names

Orthography

Procedure:

Checks 3 segments of 3 min: beginning, middle and end

Transcriptions that did not pass: sent back to transcribers

Quality Control


Conclusion

Arabic broadcast data

>2000 hours transcribed

330k words

Useful for quantative manual transcripts

Limited timeframe

Minimal but useful markup

Quality control

Training ASR systems for MT

Conclusion


Quick rich transcriptions of arabic broadcast news speech data

Thanks for your attention


  • Login