baselines for recovering from non-understandings

baselines for recovering from non-understandings a planned investigationDan Bohus DoD Reading Group, May 2004

understanding errors Two main types of errors: • System cannot form a valid interpretation for the user’s input →non-understanding • System forms an incorrect interpretation of the user’s input →misunderstanding

non-understandings • Can be a big pain, there’s not much to act upon… • S: When do you need a room? • U: blah bla blahaha [3 unparsed words] • S: I’m sorry I didn’t understand you. can you please repeat that? • What would you do in the system’s place?

an example 1 S: Ok. I will now make the reservation for Newell Simon thirty three oh five. Could I have your name please? If you are not registered with the system, just say 'guest'. U: guest [YES] 2 S: Sorry, I'm not sure I understood what you said U: guest [DATE] 3 S: I am currently trying to make a room reservation for Newell Simon thirty three oh five. Right now I need to know your name in order to make a conference room reservation. Tell me your name, or say 'guest' if you are not registered with the system. If you do not want to make this reservation, say 'cancel reservation'. U: guest [THE] 4 S: I apologize, I still did not understand you Could you try to rephrase that please? U: guest [YEAH] 5 S: Sorry, I didn't catch that. U: oh my god guest we're a guest [.WHAT. .YES. .YES. FOR TUESDAY] 6 S: Newell Simon thirty five oh one is the only room available Tuesday, April twentieth between ten and noon. This room can accommodate seventeen people, and has a projector , a whiteboard and network access. Do you want a reservation for Newell Simon thirty five oh one? U: yes [YES] 7 S: Ok. I will now make the reservation for Newell Simon thirty five oh one. Could I have your name please? If you are not registered with the system, just say 'guest'. U: guest [OKAY] 8 S: I'm sorry, I don't think I understood you correctly. Could you please repeat that? U: I'm a f****** guest [UP AGAIN] 9 S: Sorry, I still did not understand you.

causes • User’s language does not match system’s input language • Languages match, but recognition problems • Recognition OK, but parsing problems • Parsing OK, but interpretation problems

causes • User’s language does not match system’s input language • Languages match, but recognition problems • Recognition OK, but parsing problems • Parsing OK, but interpretation problems • To Do: a more detailed analysis!

what can one do ? • Notify non-understanding • Repeat system prompt • Ask repeat • Ask rephrase • Help: give state specific help • Help: give help about what the user can say • Help: establish the context

what else? • Try an alternative dialog plan to achieve the same goal • including ignore, take default value • Extract more information/content from the non-understanding, and do something smarter with that: • Use fall-back parses on the recognition hypothesis • Explicit confirm turn (Antoine) • Targeted help • Other ideas?

True causes Observables / Indicators the decision process • Handcraft a policy • Learn it: for instance in a reinforcement learning framework POLICY True causes Strategies

markov decision processes • States • Various non-understanding states • 1 understanding state (final) • Actions • Recovery strategies • Rewards • -10 on each transition to a non-understanding state -10 NU2 Repeat NU3 NU1 -10 U 0

pros and cons of learning • Cons: • Would a heuristic be good enough? • Is there going to be enough data? • Pros: • Adaptive (different levels) • Harder to devise heuristics with a large number of strategies (~); more justification • Less development effort (?)

True causes Observables / Indicators better policy or strategies? • Hypothesis: • This set of strategies is sufficient, and a good policy would make a whole lot of difference POLICY True causes Strategies ? ?

a checkpoint experiment • Run an experiment: • Let a human make the non-understanding recovery decisions • Goal: can we do significantly better than a random policy? (given a fixed set of strategies) • Create a second, higher (“upper-bound”) baseline, and hence a frame for the learning approach • Validating the set of strategies/ “Green light” for concentrating on the policy (?)

experimental design • Goal • How well does random do? Preliminary results • Variables • System / Setup • Participants • Tasks • Potential outcomes, alternatives, discussion

random baseline (preliminary) • 103 sessions (1040 utterances) RoomLine • 274 non-understandings (26.3%) • 172 non-understanding segments • [1 – 6] turns (distribution on next slide) • avg. segment length ~ 1.6 turns • To Do: more stats • Identify trouble spots • Correlation of success to various indicators

random baseline (preliminary)

confidence intervals

variables • Independent variable: recovery policy • 2 levels: random and human • 3 levels? expert-designed policy? • Dependent variable: “recovery performance” • Evaluating efficiencies of each strategy • Data requirements are problematic in WoZ condition • Evaluating global, dialog-level metrics • Task completion rates • Various statistics of error segments • To Do: Assess data requirements

variables (2) • Potential confounding variable: response time • Wizard response will be slower (how much so?) • Compensate? • Using distribution of wait times from pilot experiments • Conditions would be consistent, but both different from reality (lowered performance) • Don’t compensate? (it will presumably lower the performance) • Hmm … Other ideas?

system setup • Random condition • RoomLine: current system • Wizard condition: • RoomLine guides all interaction, except for the non-understanding recovery decisions → wizard • Physical setup: all in speech lab, wizard @ rack • noise conditions okay? • Alternative: for random condition, call from home • can be done for both between and within-subjects • are there other confounding variables? (phone line?)

system setup / strategies • Notify non-understanding • Repeat prompt / w. notify • Ask repeat / w. notify • Ask rephrase / w. notify • Help: state dependent / w. notify • Help: you can say / w. notify • Help: full help / w. notify • To Do: add “Alternative plans”

system setup / who is the wizard • Me? • Pros: already familiar with the process • Cons: might already be biased in various ways • does bias matter if I’m trying to do my best? • should I avoid biasing myself? • or should I actively try and “do my homework”? • Someone else? • Cons: will have to train, explain • Multiple wizards? • Would probably be the way to go, but too expensive

system setup / what should the wizard see? • Full Knowledge • audio • recognition results, conf scores, etc • parsing results • non-understanding type • System Knowledge • no audio – only what the system knows • that seems like a hard task for a human

participants / data • ~100 trials / strategy (0.15 conf interval) → ~200 sessions for each condition (this is @ 7 strategies) • Within subjects (?): • 40 users, 5 session in each condition (randomized) • Between subjects (?): • 2x20 users, 10 sessions • 20 “random condition”: can they call from home? • System could still have simulated response delay (?) • Balance for gender, computer-saviness(?) • Anything else?

tasks • 5/10 scenarios (out of a pool of multiple?) • How does one design those? • Any papers? Any rules? • Use graphical representation? to avoid lexical entrainment • 2 free interactions, 1 @ beginning, 1 @ end • Briefing • Debriefing: SASSI

outcomes /when wizard knows all … • There is a statistically significant improvement • We have a frame for learning • There’s space for improvement given this set of strategies • But: we can’t really claim an upper baseline! • Can use data for further analysis: • correlation of indicators to strategy invocation & success • There is no statistically significant difference • Not guaranteed what that means • Is the set of strategies too inefficient? *** • Are strategies insensitive to conditions? • Is task too complex for a human? (least likely)

outcomes /when wizard knows system • There is a statistically significant improvement • That result is even stronger than before • There is no statistically significant difference • Probably task is inappropriate for a human, but other explanations could be valid, too

most likely plan (*as of before this talk) • wizard has full audio … • i am the wizard … • train myself … • add the alternative plan strategy … • between-subjects experiments …

most likely plan (*as of now)

alternative directions … • Concentrate more on strategies • A comparative experiment to assess the benefits of having more strategies POLICY True causes True causes Observables / Indicators Strategies

alternative directions … • Different approach: • Infer true causes and use a “simple” policy POLICY Observables / Indicators True causes Strategies ?

conclusion next time …

baselines for recovering from non-understandings