Belief Updating in Spoken Dialog Systems

Belief Updating in Spoken Dialog Systems Dialogs on Dialogs Reading Group June, 2005 Dan Bohus Carnegie Mellon University, January 2004

Misunderstandings • Misunderstandings are an important problem in spoken dialog systems • System obtains an incorrect semantic interpretation of the users’ utterance • 15-40% of turns • Significant negative impact on overall success rate

Confidence annotation • Use confidence scores to guard against potential misunderstandings • Traditionally: from speech recognition engine [Chase, Bansal, Cox, Kemp, etc] • Focuses on WER, not tuned to task at hand • More recently: system-specific semantic confidence scores [Carpenter, Walker, San-Segundo, etc] • Integrate knowledge from different levels in the system: • speech recognition, language understanding, dialog management

Correction Detection • Detect whether or not the user is trying to correct the system • Related: aware-site detection • Similar ML approaches using multiple sources of knowledge [Litman, Swerts, Krahmer, etc]

Proposed: Belief Updating • Integrate confidence annotation and correction detection in a unified framework for continuously tracking beliefs • A “belief updating” problem: S: Where are you flying from? U: [CityName={Aspen/0.6; Austin/0.2}] S: Did you say you wanted to fly out of Aspen? U: [No/0.6] [CityName={Boston/0.8}] initial belief + system action + user response updated belief [CityName={Aspen/?; Austin/?; Boston/?}]

Formally… • Given: • An initial belief Pinitial(C) over concept C • A system action SA • A user response R • Construct an updated belief Pupdated(C) • As “accurate” as possible • Pupdated(C) ← f (Pinitial(C), SA, R)

Examples

Examples - continued

Outline • Introduction • Data • A simplified version of the problem. Approach • User behaviors • Learning: Preliminary results • More on evaluation • Where to from here? data: problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Data • Collected in an experiment with RoomLine • Phone-based, mixed initiative system for making conference room reservations • Equipped with explicit and implicit confirmations • Corpus statistics • 46 participants • 449 sessions, 8278 turns • 13.5% misunderstandings [9.8% / 22.5%] • 25.6% WER [19.6% / 39.5%] • 11362 concept updates data: problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Start time: Explicit Confirmation/grounding [EC] Date: Implicit Confirmation/grounding [IC] System actions and concept updates • Explicit and implicit confirmations data: problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Date: Implicit Confirmation/grounding [IC] Start time: Implicit Confirmation/grounding [IC] End time: Implicit Confirmation/task [ICT] System actions and concept updates • Implicit Confirmations Task data: problem/approach : user behaviors : preliminary results : more on evaluation : what next?

# of Conflicting Hypotheses • Below 3% involve more than 1 hypothesis • System not using multiple hypotheses • [Future work: regenerate multiple hypotheses in batch] data: problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Outline • Introduction • Data • A simplified version of the problem. Approach • User behaviors • Learning: preliminary results • More on evaluation • Where to from here? data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

A Simplified Version Given only 3% have more than 1 hypothesis, • Update belief in the top-hypothesis after implicit and explicit confirmations • Instead of • Pupdated(C) ← f (Pinitial(C), SA, R) • Do • ConfTopupdated(C) ← f (ConfTopinitial(C), SA, R) • For SA = {EC, IC, ICT} data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Approach • Use machine learning • Dataset • Concept updates for EC, IC, ICTs • Features • Initial confidence score ConfTopinitial(C) • System action (SA) • User response (R) • Target • Updated confidence score ConfTopupdated(C) • Data is labeled, so we have a binary target data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

User behaviors • Study of user behaviors in response to ICs and ECs • Can inform feature selection and feature development • Provide insights into where the difficulties are • Can inform potential strategy refinements data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

~10% User responses to ECs • Transcripts • Decoded data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

“Other” Responses to EC • “Eyeball” estimates (out of 146 responses) • ~70% simply repeat the correct concept value • That should come in as a handy feature • ~10% change conversation focus • ~10% turn overtaking issues • Maybe inhibit barge-in until Antoine finishes his thesis • ~10% other data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

User responses to ICs • Transcripts • Decoded data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Users Don’t Always Correct ICs • Actually, they corrected in 45% of the cases • That means if we knew exactly when they correct, we’d still have (126+1)/788 = 16% error • So what do users do when they don’t correct? • They may actually correct partially • Completely ignore the error … (if non-essential) • Readjust to accommodate task data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

More questions… • Understand better this “ignore” phenomenon • Impact on task success? • IC correction rate: 49% (successful tasks) vs 41% (unsuccessful) • Fixed vs more “flexible” scenarios • Impact of prompt length on P(user will correct)? • “Essential” vs “non-essential” concepts? data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Which ML technique? • Need good probability outputs • Margins produced by discriminant classifiers are inadequate • If you want probability scores, i.e. conf = 0.85 means that in 85% of cases with conf=0.85 the concept is right • evaluate on a soft-metric [I’ll contradict myself later!! ] • Step-wise logistic regression • Sample-efficient • Feature selection • Good soft-metric performance • optimizes for avg. log likelihood of data data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Data. Features • For each system action {EC, IC, ICT} • Initial Confidence score • Other indicators about current state: • How well has the dialog been going • Which concept are we talking about • How far back was this concept acquired • Features on user response • Confirmation and Disconfirmation markers • Acoustic / Prosodic: f0 (min, max, range, maxslope, etc) + normalized versions • Num words; turn length (secs) • Concept information: expected / repeated / new concepts and grammar slots… • Confidence • Barge-in & Timeout info • Lexical features (preselected by MI with “target” or confirm/disconfirm markers) data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Results • Actually using a 1-level logistic model-tree • Split on answer_type = {yes, no, other, no_parse} • Perform step-wise logistic regression on the 4 leaves • P-entry = 0.05 • P-reject = 0.30 • BIC stopping criterion • Also tried full-blown model tree, results are similar, maybe marginally worse data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Explicit Confirmation data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Implicit Confirmation data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

What can Logistic Regression / AVG-LL do for you? • D = {d1, d2, d3, d4, …} di = 1/0 • P(D) = ∏P(di=1 | xi) • Express density P(di=1 | xi) as: • P(d=1 | x) = 1 / (1 + exp(-wx)) • You can actually derive this if you start with P(x | d) gaussian • Find parameters w to max(P(D)) • argmax(P(D)) = argmax ∏P(di=1 | xi) • argmax(P(D)) = argmin ∑-log(P(di=1 | xi)) • Hence we maximize the average log-likelihood • But what does that mean? data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

This does not match the “threshold” model commonly used to engage actions Loss function in Logistic Regression • Log-likelihood loss function If d=1, then P(d=1)=0.01 is ten times worse than P(d=1)=0.1, but P(d=1)=0.7 is about the same as P(d=1)=0.8 Things are mirrored for d=0 0.01 0.1 0.7 0.8 1 d=1 data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

A New Loss Function: T2 • A loss function that better matches our domain: T2 (or even T3) d=1 d=0 C3 C1 C4 C2 0 t1 t2 1 0 t1 t2 1 • Optimize argmax ∑ T2(P(di=c | xi)) • Not differentiable  • Not convex  data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Smoothed version • A loss function that better matches our domain: T2 (or even T3) d=1 SmoothT2(p) = σ1(p) + σ2(p) σi(p) = 1 / (1+exp(ki(p-θi))) with ks and θs chosen accordingly C1 C2 0 t1 t2 1 • Optimize argmax ∑ SmoothT2(P(di=c | xi)) • Differentiable!  • But still not convex  … multiple local maxima data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Costs & Thresholds • Costs: where from? • “Expert” knowledge • Derive from data (might be tricky) • Thresholds: where from? • Fixed • Actually optimize at the same time • SmoothT2 = SmoothT2(w, th1, th2) • Differentiable in th1 and th2, so we can do gradient search for it • Calibrates in one step both the belief updating and the threshold to minimize loss data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Questions: What Next? • ICT: can we do anything there? • Looks really tough • Push for better performance • … Add more features? • … Debug the models more, eliminate singularities • … Why doesn’t the model-tree do better? • Push for better understanding • … What are the other interesting questions … • Optimize for new loss function • More in the future: look at the full belief updating problem data : problem/approach : user behaviors : preliminary results : more on evaluation : what next?

Thank You!

Encoding System Actions • For each concept update, define system action signature: <IC, ICT, EC, REQ> • IC: Implicit Confirm [grounding] • ICT: Implicit Confirm [task] • EC: Explicit Confirm • REQ: Request • Each variable can have 1 of 4 values • 0 • C (action happens on concept of interest) • OC (action happens on some other concept) • C&OC (action happens both on concept of interest and some other concept) • Only certain combinations are valid and appear in the data

Belief Updating in Spoken Dialog Systems