1 / 40

baselines for recovering from non-understandings

baselines for recovering from non-understandings. a planned investigation Dan Bohus DoD Reading Group, May 2004. understanding errors. Two main types of errors: System cannot form a valid interpretation for the user’s input → non-understanding

samaraj
Download Presentation

baselines for recovering from non-understandings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. baselines for recovering from non-understandings a planned investigationDan Bohus DoD Reading Group, May 2004

  2. understanding errors Two main types of errors: • System cannot form a valid interpretation for the user’s input →non-understanding • System forms an incorrect interpretation of the user’s input →misunderstanding

  3. non-understandings • Can be a big pain, there’s not much to act upon… • S: When do you need a room? • U: blah bla blahaha [3 unparsed words] • S: I’m sorry I didn’t understand you. can you please repeat that? • What would you do in the system’s place?

  4. an example 1 S: Ok. I will now make the reservation for Newell Simon thirty three oh five. Could I have your name please? If you are not registered with the system, just say 'guest'. U: guest [YES] 2 S: Sorry, I'm not sure I understood what you said U: guest [DATE] 3 S: I am currently trying to make a room reservation for Newell Simon thirty three oh five. Right now I need to know your name in order to make a conference room reservation. Tell me your name, or say 'guest' if you are not registered with the system. If you do not want to make this reservation, say 'cancel reservation'. U: guest [THE] 4 S: I apologize, I still did not understand you Could you try to rephrase that please? U: guest [YEAH] 5 S: Sorry, I didn't catch that. U: oh my god guest we're a guest [.WHAT. .YES. .YES. FOR TUESDAY] 6 S: Newell Simon thirty five oh one is the only room available Tuesday, April twentieth between ten and noon. This room can accommodate seventeen people, and has a projector , a whiteboard and network access. Do you want a reservation for Newell Simon thirty five oh one? U: yes [YES] 7 S: Ok. I will now make the reservation for Newell Simon thirty five oh one.  Could I have your name please? If you are not registered with the system, just say 'guest'. U: guest [OKAY] 8 S: I'm sorry, I don't think I understood you correctly. Could you please repeat that? U: I'm a f****** guest [UP AGAIN] 9 S: Sorry, I still did not understand you.

  5. causes • User’s language does not match system’s input language • Languages match, but recognition problems • Recognition OK, but parsing problems • Parsing OK, but interpretation problems

  6. causes • User’s language does not match system’s input language • Languages match, but recognition problems • Recognition OK, but parsing problems • Parsing OK, but interpretation problems • To Do: a more detailed analysis!

  7. what can one do ? • Notify non-understanding • Repeat system prompt • Ask repeat • Ask rephrase • Help: give state specific help • Help: give help about what the user can say • Help: establish the context

  8. what else? • Try an alternative dialog plan to achieve the same goal • including ignore, take default value • Extract more information/content from the non-understanding, and do something smarter with that: • Use fall-back parses on the recognition hypothesis • Explicit confirm turn (Antoine) • Targeted help • Other ideas?

  9. True causes Observables / Indicators the decision process • Handcraft a policy • Learn it: for instance in a reinforcement learning framework POLICY True causes Strategies

  10. markov decision processes • States • Various non-understanding states • 1 understanding state (final) • Actions • Recovery strategies • Rewards • -10 on each transition to a non-understanding state -10 NU2 Repeat NU3 NU1 -10 U 0

  11. pros and cons of learning • Cons: • Would a heuristic be good enough? • Is there going to be enough data? • Pros: • Adaptive (different levels) • Harder to devise heuristics with a large number of strategies (~); more justification • Less development effort (?)

  12. True causes Observables / Indicators better policy or strategies? • Hypothesis: • This set of strategies is sufficient, and a good policy would make a whole lot of difference POLICY True causes Strategies ? ?

  13. a checkpoint experiment • Run an experiment: • Let a human make the non-understanding recovery decisions • Goal: can we do significantly better than a random policy? (given a fixed set of strategies) • Create a second, higher (“upper-bound”) baseline, and hence a frame for the learning approach • Validating the set of strategies/ “Green light” for concentrating on the policy (?)

  14. experimental design • Goal • How well does random do? Preliminary results • Variables • System / Setup • Participants • Tasks • Potential outcomes, alternatives, discussion

  15. random baseline (preliminary) • 103 sessions (1040 utterances) RoomLine • 274 non-understandings (26.3%) • 172 non-understanding segments • [1 – 6] turns (distribution on next slide) • avg. segment length ~ 1.6 turns • To Do: more stats • Identify trouble spots • Correlation of success to various indicators

  16. random baseline (preliminary)

  17. random baseline (preliminary)

  18. random baseline (preliminary)

  19. random baseline (preliminary)

  20. confidence intervals

  21. experimental design • Goal • How well does random do? Preliminary results • Variables • System / Setup • Participants • Tasks • Potential outcomes, alternatives, discussion

  22. variables • Independent variable: recovery policy • 2 levels: random and human • 3 levels? expert-designed policy? • Dependent variable: “recovery performance” • Evaluating efficiencies of each strategy • Data requirements are problematic in WoZ condition • Evaluating global, dialog-level metrics • Task completion rates • Various statistics of error segments • To Do: Assess data requirements

  23. variables (2) • Potential confounding variable: response time • Wizard response will be slower (how much so?) • Compensate? • Using distribution of wait times from pilot experiments • Conditions would be consistent, but both different from reality (lowered performance) • Don’t compensate? (it will presumably lower the performance) • Hmm … Other ideas?

  24. experimental design • Goal • How well does random do? Preliminary results • Variables • System / Setup • Participants • Tasks • Potential outcomes, alternatives, discussion

  25. system setup • Random condition • RoomLine: current system • Wizard condition: • RoomLine guides all interaction, except for the non-understanding recovery decisions → wizard • Physical setup: all in speech lab, wizard @ rack • noise conditions okay? • Alternative: for random condition, call from home • can be done for both between and within-subjects • are there other confounding variables? (phone line?)

  26. system setup / strategies • Notify non-understanding • Repeat prompt / w. notify • Ask repeat / w. notify • Ask rephrase / w. notify • Help: state dependent / w. notify • Help: you can say / w. notify • Help: full help / w. notify • To Do: add “Alternative plans”

  27. system setup / who is the wizard • Me? • Pros: already familiar with the process • Cons: might already be biased in various ways • does bias matter if I’m trying to do my best? • should I avoid biasing myself? • or should I actively try and “do my homework”? • Someone else? • Cons: will have to train, explain • Multiple wizards? • Would probably be the way to go, but too expensive

  28. system setup / what should the wizard see? • Full Knowledge • audio • recognition results, conf scores, etc • parsing results • non-understanding type • System Knowledge • no audio – only what the system knows • that seems like a hard task for a human

  29. experimental design • Goal • How well does random do? Preliminary results • Variables • System / Setup • Participants • Tasks • Potential outcomes, alternatives, discussion

  30. participants / data • ~100 trials / strategy (0.15 conf interval) → ~200 sessions for each condition (this is @ 7 strategies) • Within subjects (?): • 40 users, 5 session in each condition (randomized) • Between subjects (?): • 2x20 users, 10 sessions • 20 “random condition”: can they call from home? • System could still have simulated response delay (?) • Balance for gender, computer-saviness(?) • Anything else?

  31. experimental design • Goal • How well does random do? Preliminary results • Variables • System / Setup • Participants • Tasks • Potential outcomes, alternatives, discussion

  32. tasks • 5/10 scenarios (out of a pool of multiple?) • How does one design those? • Any papers? Any rules? • Use graphical representation? to avoid lexical entrainment • 2 free interactions, 1 @ beginning, 1 @ end • Briefing • Debriefing: SASSI

  33. experimental design • Goal • How well does random do? Preliminary results • Variables • System / Setup • Participants • Tasks • Potential outcomes, alternatives, discussion

  34. outcomes /when wizard knows all … • There is a statistically significant improvement • We have a frame for learning • There’s space for improvement given this set of strategies • But: we can’t really claim an upper baseline! • Can use data for further analysis: • correlation of indicators to strategy invocation & success • There is no statistically significant difference • Not guaranteed what that means • Is the set of strategies too inefficient? *** • Are strategies insensitive to conditions? • Is task too complex for a human? (least likely)

  35. outcomes /when wizard knows system • There is a statistically significant improvement • That result is even stronger than before • There is no statistically significant difference • Probably task is inappropriate for a human, but other explanations could be valid, too

  36. most likely plan (*as of before this talk) • wizard has full audio … • i am the wizard … • train myself … • add the alternative plan strategy … • between-subjects experiments …

  37. most likely plan (*as of now)

  38. alternative directions … • Concentrate more on strategies • A comparative experiment to assess the benefits of having more strategies POLICY True causes True causes Observables / Indicators Strategies

  39. alternative directions … • Different approach: • Infer true causes and use a “simple” policy POLICY Observables / Indicators True causes Strategies ?

  40. conclusion next time …

More Related