DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech

DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech John Makhoul, Rich Schwartz,Alex Baron, Ivan Bulyko, Long Nguyen, Lance Ramshaw, Dave Stallard, Bing Xiang

Objective • Estimate speech recognition accuracy required to support utility in the form of question answering (QA) • Follow-on to earlier DUTIE study from text • Entities and relations extracted into database, which was used by human subjects for QA task • Measured human QA performance as function of information extraction (IE) scores • Extension to speech recognition • Measure effect of speech recognition error on IE scores • Assume same relation between IE scores and QA, infer effect of speech recognition on QA performance

Original DUTIE Study with Text Input • Databases of fully automated IE and manual annotation • Populated with entities, relations, co-reference links • 946 articles • Two databases were blended to produce a continuum of database qualities, as measured by • Entity Value Score (EVS) • Relation Value Score (RVS) • For each database, measured human performance • QA performance • Time taken to answer each question in seconds

DUTIE Results • Need to reduce IE error rate by about in half to achieve 70% QA performance

Relative QA Performance vs. EVS • Same results, just scaled by QA with perfect IE scores

DUTIE Speech Corpus • The DUTIE speech corpus consists of 946 articles with 34.7 hours of audio data in total • Same articles as in the original DUTIE study • 15.5 hours of TDT broadcast news data • ABC, CNN, PRI, VOA (Jan. 1998 ~ June 1998, Oct. 2000 ~ Dec. 2000) • MNB, NBC (Oct. 2000 ~ Dec. 2000) • 19.2 hours of Newswire read speech recorded at LDC • APW, NYT (Feb. 1998 ~ June 1998, Oct. 2000 ~ Dec. 2000)

DUTIE Speech Process • Speech Recognition • Takes audio; outputs text in SNOR format • Run at four different levels of accuracy • Punctuation • Takes recognition output; adds periods/commas • Two methods: Forced alignment vs. automatic punctuation • Information Extraction (IE) • Takes punctuated text and finds entities and relations • Produces ACE Program Format (APF) XML • Scoring IE • Compares test and reference APFs and computes Entity Value Score and Relation Value Score

Block Diagram

Speech Recognition • Four systems to produce a range of word error rates • System I: BBN RT04 stand-alone 10xRT system, with heavily-weighted DUTIE text in language model training (cheating) • System II: BBN RT04 stand-alone 10xRT system, with normally-weighted DUTIE text in language model training (some cheating) • System III: BBN RT02 system (Fair) • System IV: BBN RT02 system, with decreased grammar weight in decoding (degraded)

Sentence Boundary Detection Model • Sentence boundary included periods, questions marks, exclamation points • Use a 3-gram LM to compute probabilities of sentence boundary at each word position [Stolcke 1996] • Training data • TDT3 closed captions (12M words) • HUB4 transcripts (120M words) • Gigaword News articles from 2000 (100M words) • Use Viterbi to find the most likely sequence of tags

Automatic Punctuation Results • 3-gram word LM gives near-state-of-the-art period error rate (state-of-the-art is 60% as reported at RT-04) • Punctuation performance is sensitive to WER (in part due to LM being trained on errorless text) • Further improvements possible with new models or prosodic features State-of-the-art ASR

Reference Punctuation • Tokenize reference into words labeled with punctuation triplets 1) Punctuation attached to beginning of word 2) Punctuation attached to end of word 3) Unattached punctuation (e.g. hyphens) to right of word • Align reference and hypothesis words • Attach each reference word’s punctuation to the hypothesis word it is aligned to Ref text: Hello, I’m looking for a size ten shoe. I prefer black, and don’t care about price. ASR out:JELLO I’M LOOKING FOR * SHOE I PREFER * AND DON’T CARE ABOUT PRICE Output: JELLO, I’M LOOKING FOR SHOE. I PREFER, AND DON’T CARE ABOUT PRICE.

….. <entity ID="2" TYPE="GPE" SUBTYPE="Other“> <entity_mention TYPE="NAM" ID="104-1"> <extent> <charseq START="75" END="82"></charseq> </extent> </entity_mention> </entity> …. MOSCOW (AP) _ Presidents Leonid Kuchma of Ukraine and Boris Yeltsin of Russia signed an economic cooperation plan Friday ``We have covered the entire list of questions and discussed how we will be tackling them,'' Yeltsin was quoted as saying . Information Extraction • Finds entities and relations between them • Identifies entities by character offset interval in the input text file • Character offset is defined literally: All whitespace and punctuation is included! • Produces ACE Program Format (APF) XML expression

Scoring IE, Part I • IE scoring program compares the character offset intervals of entities in reference and test APFs • Requires 30% overlap • Problem #1: Character offsets in reference APFs reflect all whitespace formatting in original text file • But recognizer output will have different character offsets, so offsets will be wrong • Solution • Align words in reference and test • Based on this alignment, compute character offset mapping between reference and test • Change character positions in test APF using mapping • Compute IE scores

Scoring IE, Part II • Problem #2: IE scoring program only compares character offset intervals, not the words in them • So it may ignore word errors in a name • “George Hush” vs. “George Bush” • Solution: Modify scoring program to require match of alphanumeric characters in the test and reference character intervals • Modification courtesy of George Doddington • Requires 50% content overlap

Detailed Results

Effect of Punctuation on Entity Value Score • Sentence boundaries are required but locations are not critical (loss is 2.8% relative with 62% period error rate) • Loss of comma results in 9.5% reduction in Entity score • Importance of appositives to IE (“George W. Bush, President of the United States, said this morning …”)

Entity Value Scores as Function of WER • Effect of WER on Entity score is linear • Loss for automatic punctuation relative to reference is 13.5% relative

Relation Value Scores as Function of WER • Loss for automatic punctuation relative to reference is 25% relative

Relation Between WER and IE Scores • Entity Value Score (EVS) and Relation Value Score (RVS) are linear function of WER • Automatic punctuation has multiplicative effect on scores • Relative QA as a function of EVS

Predicted Relative QA vs. WER and EVS(ref) • At 12% WER with today’s IE, we get 33% of maximum QA • Near zero for 25% WER (e.g., non-English) • With half the IE error rate, half WER, half the loss from punctuation, we estimate 72% of maximum QA

Conclusions • IE scores degrade linearly with WER • Sentence boundaries are required but locations are not critical • Commas are important for IE • With current technology (e.g., 12% WER and 60% EVS on text), we can only achieve 33% of maximum QA performance • If IE error and WER were cut by half and loss due to commas cut in half, QA performance could increase to over 70% of maximum

DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech