Early error detection on word level

Early error detection on word level Gabriel Skantze and Jens Edlund {gabriel,edlund}@speech.kth.se Centre for Speech Technology Department of Speech, Music and Hearing KTH, Sweden

Overview • How do we handle errors in conversational human-computer dialogue? • Which features are useful for error detection in ASR results? • Two studies on selected features: • Machine learning • Human subjects’ judgement

Error detection • Early error detection • Detect if a given recognition result contains errors • e.g. Litman, D. J., Hirschberg, J., & Swertz, M. (2000). • Late error detection • Feed back the interpretation of the utterance to the user (grounding) • Based on the user’s reaction to that feedback, detect errors in the original utterance • e.g. Krahmer, E., Swerts, M., Theune, T. & Weegels, M. E. (2001). • Error prediction • Detect that errors may occur later on in the dialogue • e.g. Walker, M. A., Langkilde-Geary, I., Wright Hastie, H., Wright, J., & Gorin, A. (2002).

Why early error detection? • ASR errors reflect errors in acoustic and language models. Why not fix them there? • Post-processing may consider systematic errors in the models, due to mismatched training and usage conditions. • Post-processing may help to pinpoint the actual problems in the models. • Post-processing can include factors not considered by the ASR, such as: • Prosody • Semantics • Dialogue history

Corpus collection Speaks Reads ASR Vocoder Speaks Listens User Operator I have the lawn on my right and a house with number two on my left i have the lawn on right is and a house with from two on left

Study I: Machine learning • 4470 words • 73.2% correct (baseline) • 4/5 training data, 1/5 test data • Two ML algorithms tested • Transformation-based learning (µ-TBL) • Learn a cascade of rules that transforms the classification • Memory-based learning (TiMBL) • Simply store each training instance in memory • Compare the test instance to the stored instances and find the closest match

Features

Results • Content-words: • Baseline: 69.8%, µ-TBL: 87.7%, TiMBL: 87.0%

Rules learned by µ-TBL

Study II: Human error detection • First 15 user utterances from 4 dialogues with high WER • 50% of the words correct (baseline) • 8 judges • Features were varied for each utterance: • ASR information • Context information

Features

The judges’ interface Correction field Dialogue so far 5-best list Grey scale reflect word confidence Utterance confidence

Results

Conclusions & Discussion • ML can be used for early error detection on word level, especially for content words. • Word confidence scores have some use. • Utterance context and lexical information improve the ML performance. • A rule-learning algorithm such as transformation-based learning can be used to pinpoint the specific problems. • N-best lists are useful for human subjects. How do we operationalise them for ML?

Conclusions & Discussion • The ML improved only slightly from the discourse context. • Further work in operationalising context for ML should focus on the previous utterance • The classifier should be tested together with a parser or keyword spotter to see if it can improve performance. • Other features should be investigated, such as prosody. These may improve performance further.

The End Thank you for your attention! Questions?

Early error detection on word level