Speech Transcription for Broadcast Activities: The science, the art, and business realities

Speech Transcription for Broadcast Activities: The science, the art, and business realities Sara H. Basson Michael Picheny Bhuvana Ramabhadran IBM T.J Watson Research Center

Agenda • Captioning and Transcription: The need • The options • Automated speech transcription: state of the art • Is it ready for prime time? – samples from network transcripts • Quality control • Near-term solutions • The future

Time for editing • (= cost of captioning) decreases as speech recognition accuracy improves. Lack of Captioning and Transcription – The Problem • Proliferation of multimedia information • Audio: not always the medium of choice • Violates accessibility • 22,000,000 Americans listed as deaf or hard of hearing • Aging users US Federal Gov’t: 2001 amendment to Section 508 of the Rehabilitation Act: mandates that information that federal agencies provide to the public or to their employees be accessible.

Transcription of Audio Material: It’s the Law Telecommunications Act of 1996: • 100% of new English-language programming must be captioned by 2006 • 100% of Spanish-language programming must be captioned by 2010

Closed Captioning General dictation Call center data mining Government intelligence applications Unconstrained Speech Conversational Large Vocabulary High Resource Telephone, Broadcast,Speeches Transcription Contrasted with Other Speech Recognition Transcription Transaction Embedded “For mortgage rates, say or press 1…” “Please say your tracking number…” Name Dialer • More constrained • More directed • Large Vocabulary • Lower Resource • Telephone Direction giving in car Spoken commands in car Phrase translation on a PDA • Most constrained • Most directed • Smaller Vocabulary • Lowest Resource • Embedded in a device

Audio requiring transcription/captioning • Webcasts • Podcasts • Television programming • Movies • Digitized lectures • e-Learning materials • Corporate training • Meetings • Conferences • Tourist information • Medical transcription • Legal transcription • Call center data = Strong accessibility requirement (user demand, and corporate/legal mandates)

Connected Digit Sequences (TI Digits) • TIMIT Acoustic-Phonetic Continuous Speech Corpus • Broadcast News (BN) • Speech in Noisy Environments (SPINE) • Switchboard (SWB) • Telephone conversations (about 70 topics) • MALACH Corpus Speech Recognition Challenges Over Time Increasingcomplexity

Progress in Base Technology Research Base speech recognition technology has improved steadily over the last 15 years. Current error rates are low enough for many practical applications. Progress in Conversational Speech Progress in IBM Speech Products IBM SuperhumanSpeech Project IBM Websphere Voice Server - Telephony NIST Benchmarks IBM Embedded Via Voice in Car Human Performance – Conversational Telephony The NIST benchmark uses different test datasets each year, focusing on conversational speech. Average error rates for 10 simple tasks (digits, name dialing, etc.) In-car tests are performed at several speed/noise levels.

Frequent interruptions: • CHURCH TWO DAYS these were the people who were to go to march TO MARCH and your brother smuggled himself SMUGGLED IN IN IN IN • Disfluencies • A- a- a- a- band with on- our- on- our- arm Emotional speech • young man they ripped his teeth and beard out they beat him Multimedia digital archive: 116,000 hours of interviews with over 52,000 survivors, liberators, rescuers and witnesses of the Nazi Holocaust, recorded in 32 languages. Goal: improved access to large multilingual spoken archives MALACH: A challenging speech corpus Challenges:

Named Entity Detection in Segmentation 31 named entity tags: Person Organization Location Country Cardinal number Ordinal number Percentage Money Date Duration Age Animal Plant Substance Occupation Disease …

Captioning audio: What are the options? OptionsIssues • Text-based search vs. audio-based search • Reading text: faster than listening to the auditory equivalent • Second language learners • Individuals with certain learning disabilities Captioning and transcribing audio material: Additional Advantages

Understandability….ASR vs. stenocaptioning: Manageable errors ASR: • a picture perfect landing for the space shuttle atlantis this morning the shuttle touched down at the kennedy space center in florida about six twenty one this morning INending a twelve day mission TRUTH: • a picture perfect landing for the space shuttle atlantis this morning the shuttle touched down at the kennedy space center in florida about six twenty one this morning ** ending a twelve day mission ASR: • since the diet drug combination FEN fen was pulled off the market some dieters **** been looking for something that would work as well we will see what's in the works TRUTH: • since the diet drug combination PHEN fen was pulled off the market some dieters HAVE been looking for something that would work as well we will see what's in the works

Understandability….ASR vs. stenocaptioning: Distracting/confusing ASR: • ** TOOK IT makes a lot of FOLKS and also ** THAT e. mail volleys more than twice pick up the phone TRUTH • O. K. THAT makes a lot of SENSE and also IF AN e. mail volleys more than twice pick up the phone ASR: • STAY connected through e. mail has become very common in a lot of homes IN on the job but ********* on how it's used it can be terrific FOR disastrous we will look at some e. mail problems THAT possible solutions TRUTH: • STAYING connected through e. mail has become very common in a lot of homes AND on the job but DEPENDING on how it's used it can be terrific OR disastrous we will look at some e. mail problems AND possible solutions ASR: • so they do not have to make their own interpretation makes a lot of THINGS another tip TO write an e. mail IS WHAT IT a news paper article in other words state the most pertinent information first we always say in the news business do not bury the lead TRUTH: • so they do not have to make their own interpretation makes a lot of SENSE another tip TOO write an e. mail AS YOU WOULD a news paper article in other words state the most pertinent information first we always say in the news business do not bury the lead

Text and punctuation

Quality control for broadcast captioning Thursday, July 05, 2007 Closed Captions On Ohio TV: 24/7 Gibberish Dished To The Disabled

Quality control for Broadcast Captioning • Q: Do captions have to meet accuracy requirements, such as having only so many spelling errors per program? • A: At present, captions are not required to meet any particular quality or accuracy standards. The Federal Communications Commission concluded that program providers have incentives to offer high quality captions, in keeping with the overall quality of the programs they offer. The FCC also concluded that it would be difficult to develop and monitor quality standards at this time. However, viewers may let video providers know whether they are satisfied with the captions through purchases of advertised products, subscriptions to program services, or contacts with providers concerning the programs. The above information has been excerpted from the FCC guidelines and the Captioned Media Program of the National Association of the Deaf.

Using ASR for captioning….incrementally…UK Media and re-speaking

Using ASR for Broadcast Captioning..incrementally…Protitle Live System • Enables creation of subtitles in all major languages, using speech recognition • Functions Correction in real time Validation in real time • Timing Total cycle time between 2 to 7 seconds 5 seconds on average • Economics- Re-speaking: 1/10th the cost of real time stenographer

Using ASR for Broadcast captioning…incrementally…Real-time editing • Assume: speaker obtains 80 percent ASR accuracy when speaking at a rate of 150 words a minute • Editor needs to correct 15 words in a minute to increase the accuracy to 90 percent. • by choosing the 15 most important errors, some of the remaining 15 errors may not detract significantly from understanding. • In classrooms in the UK and in other countries disabled students have people taking notes for them who are trying to type or write much faster than 15 words/minute to record as much as possible. If instead of trying to record everything, the speaker used speech recognition, the note taker need only type the corrections. • People can read four or more times faster than somebody speaks. • Therefore: possible to do ‘something else’ when reading words displayed at speaking speeds • Real time editing can be separated into three activities: • Finding the error and highlighting it • Entering the correction • Replacing the error with the correction • Using foot pedals to move the highlight to the exact position and triggering the replacement could enable the hands to remain free for entering the corrections. Source: Professor M. Wald, Southampton University

Automated measures of accuracy Proposal from the WGBH National Center for Accessible Media (NCAM) • Use language-processing tools to develop an automated caption accuracy assessment system for real-time captions on live news programming • Can text-based data mining and speech-to-text technologies produce meaningful data about stenocaption accuracy? • Explore the capabilities of data mining software agents to identify discrepancies between errors contained within stenocaption data sets and speech-to-text data sets, and generate a caption accuracy analysis of the data set under review. Through these methods, goal is to: • Improve the ability of the television community to monitor and maintain the quality of live captioning they offer to viewers who are deaf or hard of hearing • Ease the current burden on caption viewers to document and advocate for comprehensible captions.

Future vision… • Automatic Speech Transcription for less regulated arenas • Captioning podcasts, lectures, meetings, presentations… • Easier tools to modify and customize • Easier and more cost-effective mechanisms to deliver • Understanding quality control issues - - what is accuracy, what is the cost of an error • Back-up options • More pervasive usage  Higher quality deliverables

Speech Transcription for Broadcast Activities: The science, the art, and business realities