Characterizing Task-Oriented Dialog using a Simulated ASR Channel

Jason D. Williams Machine Intelligence Laboratory Cambridge University Engineering Department Characterizing Task-Oriented Dialog using a Simulated ASR Channel

Motivation for the data collection Experimental set-up Transcription & Annotation Effects of ASR error rate on… Turn length / dialog length Perception of error rate Task completion “Initiative” Overall satisfaction (PARADISE) SACTI-1 CorpusSimulated ASR-Channel – Tourist Information

HH dialog ASR channel vs. HH channel Properties ASR channel • “Instant” communication • Effectively perfect recognition of words • Prosodic information carries additional information • Turns explicitly segmented • Barge-in, End-pointed • Prosody virtually eliminated • ASR & parsing errors Observations • Frequent but brief overlaps • 80% of utterances contain fewer than 12 words; 50% < 5 • Approximately equal turn length • Approximately equal balance of initiative • About half of turns are ACK (often spliced) • Few overlaps • Longer system turns; shorter user turns • Initiative more often with system • Virtually no turns are ACK • Virtually no splicing Are models of HC dialog/grounding appropriate in the presence of the ASR channel?

Study the ASR channel in the abstract WoZ experiments using a simulated ASR channel Understand how people behave with an “ideal” dialog manager For example, grounding model Use these insights to inform state space and action set selection Note that collected data has unique properties useful to: RL-based systems Hidden-state estimation User modeling Formulate dialog management problem as a POMDP Decompose state into BN nodes – for example: Conversation state (grounding state) User action User belief (goal) Train using data collected Solve using approximations My approach

To build a user model, we need to see the user’s reaction to all kinds of misunderstandings However, most systems use a fixed policy Systems typically do not take different actions in the same situation Taking random actions is clearly not an option! Constraining actions means building very complex systems… … and which actions should be in the system’s repertoire? The paradox of “dialog data”

…would show users reactions to a variety of error handling strategies (no fixed policy) BUT would not be nonsense dialogs! …would use the ASR channel …would explore a variety of operating conditions – e.g., WER rate …would not assume a particular state space ... would somehow “discover” the set of system actions An ideal data collection…

Data collection set-up

Simple energy-based barge-in User interrupts wizard ASR simulation state machine Userstarts talking SILENCE USER_TALKING Typist done; reco result displayed Userstops talking Wizard stops talking Wizard starts talking Userstarts talking WIZARD_TALKING TYPIST_TYPING

Simplified FSM-based recognizer Weighted finite state transducer (WFST) Flow Reference input Spell-checked against full dictionary Converted to phonetic string using full dictionary Phonetic lattice generated based on confusion model Word lattice produced Language model composed to re-score lattice “De-coded” to produce word strings N-Best list extracted. Various free variables to induce random behavior Plumb N-Best list for variability ASR simulation

ASR simulation evaluation • Hypothesis: Simulation produces errors similar to errors induced by additive noise w.r.t. concept accuracy • Assess concept accuracy as F-Measure using automated, data-driven procedure (HVS model) • Plot concept accuracy of: • Real additive noise • A naïve confusion model (simple insertion, substitution, deletion) • WFST confusion model • WFST appears to follow real data much more closely than naïve model

Tourist / Tourist information scenario Intentionally goal-directed Intentionally simple tasks Mixtures of simple information gathering and basic planning Wizard (Information giver) Access to bus times, tram times, restaurants, hotels, bars, tourist attraction information, etc. User given series of tasks Likert scores asked at end of each task 4 Dialogs / user; 3 users/Wizard Scenario & Tasks • Example task: Finding the perfect hotel • You’re looking for a hotel for you and your travelling partner that meets a number of requirements. • You’d like the following: • En suite rooms • Quiet rooms • As close to the main square as possible • Given those desires, find the least expensive hotel. You’d prefer not compromise on your requirements, but of course you will if you must! • Please indicate the location of the hotel on the map and fill in the boxes below.

User’s Map

Wizard’s map

User/wizard each given 6 questions after each task Subject example: In this task, I accomplished the goal. In this task, I thought the speech recognition was accurate. In this task, I found it difficult to communicate because of the speech recognition. In this task, I believe the other subject was very helpful. In this task, the other subject found using the speech recognition difficult. Overall, I was very satisfied with this past task. Likert-scale questions

User-side transcribed during experiments Prioritized for speed "I NEED UH I’M LOOKING FOR A PIZZA" Wizard-side transcribed using a subset of LDC transcription guidelines – more detail "ok %uh% (()) sure you –- i can" epErrorEnd=true Transcription

Annotation (acts) • Each turn is a sequence of tags • Inspired by Traum’s “Grounding Acts” • More detailed / easier to infer from surface words

Annotation (understanding) • Each wizard turn was labeled to indicate whether the wizard understood the previous user turn

Corpus summary

How accurately do users & wizards perceive WER? Perceptions of recognition quality broadly reflected actual performance, but users consistently gave higher quality scores than wizards for the same WER Perception of ASR accuracy

How does WER affect wizard & user turn length? Wizard turn length increases User turn length stays relatively constant Average turn length (words)

How does WER affect wizard grounding behavior? As WER increases, wizard grounding behaviors become increasingly prevalent Grounding behavior

How does WER affect wizard understanding status? Misunderstanding increases with WER… …and task completion falls (83%, 83%, 77%, 42%) Wizard understanding

Wizard strategies • Classify each wizard turn into one of 5 “strategies”

What are the most successful strategy after known dialog trouble? This plot shows wizard understanding status one turn after known dialog trouble: effect of ‘REPAIR’ vs ‘ASKQ’. ‘S’ indicates significant differences Wizard strategies

How does a user respond after being misunderstood? Surprisingly little explicit indication! User reactions to misunderstandings

How does “initiative” vary with WER? Define wizard “initiative” using strategies, above Level of wizard “initiative”

Satisfaction = Task completion + Dialog Cost metrics 2 kinds of user satisfaction: Single Combi 3 kinds of task completion User Obj Hyb Cost metrics PerDialogWER %UnFlaggedMis %FlaggedMis %Non Turns %REPAIR %ASKQ Reward measures/PARADISE

In almost all experiments using the User task completion metric, it was the only significant predictor The single/combi metrics almost always selected the same predictors Reward measures/PARADISE

What indicators best predict user satisfaction? When run on all data, mixtures of Task, Turns, %UnFlaggedMis best predict user satisfaction. %UnFlaggedMis is serving as a better measurement of understanding accuracy than WER alone, since it effectively combines recognition accuracy with a measure of confidence. Broadly speaking: Task completion is most important at the High WER level Task completion and dialog quality is most important at the Med WER level Efficiency is most important at the Low WER level These patterns mirror findings from other PARADISE experiments using Human/Computer data This gives us some confidence that this data set is valid for training Human/Computer systems Reward measures/PARADISE

At moderate WER levels, asking task-related questions appears to be more successful than direct dialog repair. Levels of expert “initiative” increase with WER, primarily as a result of grounding behavior. Users infrequently give a direct indication of having been misunderstood, with no clear correlation to WER. When run on all data, mixtures of Task, Turns, %UnFlaggedMis best predict user satisfaction. Task completion appears to be most predictive of user satisfaction; however, efficiency shows some influence at lower WERs. Next… apply this corpus to statistical systems. Conclusions/Next steps

Thanks! Jason D. Williams jdw30@cam.ac.uk

Characterizing Task-Oriented Dialog using a Simulated ASR Channel

Characterizing Task-Oriented Dialog using a Simulated ASR Channel

Presentation Transcript

Robot Paintings Evolved Using Simulated Robots

ASR

Channel Assignment using Chaotic Simulated Annealing Enhanced Hopfield Neural Network

Building an ASR using HTK CS4706

Characterizing Link Properties Using “Loss-pairs”

Task oriented processing

Building an ASR using HTK CS4706

Characterizing SNPs Using Genomic Information

Using the Simulated Data Set

Module2: Using Dialog Tools

Building an ASR using HTK CS4706

Building an ASR using HTK CS4706

Using a Multi-Channel approach

Characterizing a Sustainability Transition:

Characterizing Rural England using GIS

ASR

ASR

First analysis of the fully simulated channel