350 likes | 359 Views
Automatic Classification of Speech Recognition Hypotheses Using Acoustic and Pragmatic Features. Malte Gabsdil Universität des Saarlandes. The Problem (theoretical). Grounding: establishing common ground between dialogue participants “Did H correctly understand what S said?”
E N D
Automatic Classification of Speech Recognition Hypotheses Using Acoustic and Pragmatic Features Malte Gabsdil Universität des Saarlandes
The Problem (theoretical) • Grounding: establishing common ground between dialogue participants • “Did H correctly understand what S said?” • Combination of bottom-up (“signal”) and top-down (“expectation”) information • Clark (1996): Action ladders • upward completion • downward evidence
The Problem (practical) • Assessment of recognition quality for spoken dialogue systems • Information sources • speech/recognition output (“acoustic”) • dialogue/task context (“pragmatic”) • Crucial for usability and user satisfaction • avoid misunderstandings • promote dialogue flow and efficiency
interpretation Dialogue Manager dialogue history dialogue model response selection The General Picture • Dialogue System Architecture ASR Generation
interpretation Best hypothesis + confidence n-Best hypotheses + confidence Dialogue Manager dialogue history dialogue model Confidence rejection thresholds response selection Acoustic features Pragmatic features A Closer Look • How to assess recognition quality? • decision problem ASR Machine Learning Classifier
Overview • Machine learning classifiers • Acoustic and pragmatic features • Experiment 1: Chess • exemplary domain • Experiment 2: WITAS • complex spoken dialogue system • Conclusions & Topics for Future Work
Machine Learning Classifiers • Concept learners • learn decision function • training: present feature vectors annotated with correct class • testing: classify unseen feature vectors • Combine acoustic and pragmatic features to classify recognition hypotheses as accept, (clarify), reject, or ignore
Acoustic Information • Derived from speech waveforms and recognition output • Low level features • amplitude, pitch (f0), duration, tempo (e.g. Levow 1998, Litman et al. 2000) • Recogniser confidences • normalised probability that a sequence of recognised words is correct (e.g. Wessel et al. 2001)
Pragmatic Information • Derived from the dialogue context and task knowledge • Dialogue features • adjacency pairs: current/previous dialogue move, DM bigram frequencies • reference: unresolvable definite NPs/PROs • Task features (scenario dependent) • evaluation of move scores (Chess), conflicts in action preconds and effects of (WITAS)
Experiment 1: Chess • Recognise spoken chess move instructions • speech interface to computer chess program • Exemplary domain to test methodology • nice properties, easy to control • Pragmatic features: automatic move evaluation scores (Crafty) • Acoustic features: recogniser confidence scores (Nuance 8.0)
Data & Design • Subjects replay given chess games • instruct each other to move pieces • approx. 2000 move instructions in different data sets (devel, train, test) • 5 x 2 x 6 design • 5 systems for classifying recognition results (main effect) • 2 game levels (strong vs. weak) • 6 pairs of players
Systems • Task: accept or reject rec. hypotheses • Baseline • confidence rejection threshold • binary classification of best hypothesis • ML System • SVM learner (best on dev. set) • binary classication of 10-best results • choose first classified as accept, else reject
Results • Accuracy: • Baseline: 64.3%, ML System: 97.2%
Evaluation • 82.2% relative error rate reduction • χ² test on confusion matrices • highly significant (p<.001) • Combination of acoustic and pragmatic information outperforms standard approach • System reacts appropriately more often → increased usability
Experiment 2: WITAS • Operator interaction with robot helicopter • Multi-modal references, collaborative activities, multi-tasking • Differences to chess experiment • complex dialogue scenario • complex system (ISU-based, planning, …) • much larger grammar and vocabulary • Chess 37 GR, Vocab 50 FEHLT WAS • open mic recordings (ignore class)
Data Preparation • 30 dialogues (6 users, 303 utterances) • Manual transcriptions • Offline recognition (10best) and parsing→ quasi-logical forms • Hypothesis labelling: • accept: same QLF • reject: out-of-grammar or different QLF • ignore: “crosstalk” not directed to system
Example Features • Acoustic • low level: amplitude (RMS) • recogniser: hypothesis confidence score, rank in nbest list • Pragmatic • dialogue: current/previous DM, DM bigram probability, #unresolvable definite NPs • task: #conflicts in planning operators (e.g. already satisfied effects)
Results • Context-sensitive LMs • Accuracy: 65.68% • Wfscore: 61.81% • (higher price) • TiMBL (optimised) • Accuracy: 86.14% • Wfscore: 86.39%
Evaluation • 59.6% relative error rate reduction • χ² test on confusion matrices • highly significant (p<.001) • Combination of acoustic and pragmatic information outperforms grammar switching approach • System reacts appropriately more often → increased usability
WITAS Features • Importance according to χ² 1. confidence (ac: recogniser) 2. DMBigramFrequency (pr: dialogue) 3. currentDM (pr: dialogue) 4. minAmp (ac: low level) 5. hypothesisLength (ac: recogniser) 6. RMSamp (ac: low level) 7. currentCommand (pr: dialogue) 8. minWordConf (ac: recogniser) 9. aqMatch (pr: dialogue) 10. nbestRank (ac: recogniser)
Summary/Achievements • Assessment of recognition quality for spoken dialogue systems (grounding) • Combination of acoustic and pragmatic information via machine learning • Highly significant improvements in classification accuracy over standard methods (incl. “grammar switching”) • Expect better system behaviour and user satisfaction
Topics for Future Work • Usability evaluation • systems with and w/o classification module • Generic and system-specific features • which features are available across systems? • Tools for ISU-based systems • module in DIPPER software library • Clarification • flexible generation (alternative questions, word-level clarification)
Our Proposal • Combine acoustic and pragmatic information in a principled way • Machine learning to predict the grounding status of competing recognition hypotheses of user utterances • Evaluation against standard methods in spoken dialogue system engineering • confidence rejection thresholds
1. 1. 2. 2. 3. 3. 4. 4. 5. 5. 6. 6. dialogue contexttask knowledge 7. 7. 8. 8. 9. 9. … … Dialogue Manager ASR ML classifier Acoustic information Pragmatic information Application
Chess Results • Accuracy: • Base: 64.3%, LM: 93.5%, ML: 97.2%
Data/Baseline • Data from user study with WITAS • 6 subjects, 5 tasks each (“open mic”) • 30 dialogues (303 user utterances) • recorded utterances and logs of WITAS Information State (dialogue history) • Originally collected to evaluate a “grammar switching” version of WITAS (= Baseline; Lemon 2004)
Data Preparation/Setup • Manual transcription of all utterances • Offline recognition (10best) with “full” grammar and processing with NLU component (quasi-logical forms) • Hypothesis labelling: • accept: same QLF • reject: out-of-grammar or different QLF • ignore: “crosstalk” not directed to system
Acoustic Features • Low level: • RMSamp, minAmp (abs), meanAmp (abs) • motiv: detect crosstalk • Recogniser output/confidence scores: • nbest rank, hypothesisLength (words) • hyp. confidence, confidence zScore, confidence SD, minWordConf • motiv: quality estimation within and across hypotheses
Pragmatic Features • Dialogue: • currentDM, DMTactiveNode, qaMatch, aqMatch, DMbigramFreq, currentCommand • #unresNPs, #unresPROs, #uniqueIndefs • motiv: adjacency pairs, unlikely references • Task: • taskConflict (same command already active), taskConstraintConflict (fly vs. land)
Importance of Features • What are the most predictive features? • χ² statistics: correlate feature values with different classes • computed for each feature from value/class contingency tables • Oij: observed frequencies; Eij: expected frequencies • n.j: sum over column j; ni.: sum over row i;n..: #instances
Simple Example • Feature A • Feature B • χ² = 0 • χ² = 2*(30²/50)+2*0+ 2*(55²/75)+ 2*(85²/200) = 188.92