Learning Optimal Strategies for Spoken Dialogue Systems

Learning Optimal Strategies for Spoken Dialogue Systems Diane Litman University of Pittsburgh Pittsburgh, PA 15260 USA ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Outline • Motivation • Markov Decision Processes and Reinforcement Learning • NJFun: A Case Study • Advanced Topics ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Motivation • Builders of real-time spoken dialogue systems face fundamental design choices that strongly influence system performance • when to confirm/reject/clarify what the user just said? • when to ask a directive versus open prompt? • when to use user, system, or mixed initiative? • when to provide positive/negative/no feedback? • etc. • Can such decisions be automatically optimized via reinforcement learning? ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Spoken Dialogue Systems (SDS) • Provide voice access to back-end via telephone or microphone • Front-end: ASR (automatic speech recognition) and TTS (text to speech) • Back-end: DB, web, etc. • Middle: dialogue policy(what action to take at each point in a dialogue) ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Speech Recognition Text to Speech Typical SDS Architecture LanguageUnderstanding Dialogue Policy Domain Back-end Language Generation ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Reinforcement Learning (RL) • Learning is associated with a reward • By optimizing reward, algorithm learns optimal strategy • Application to SDS • Key assumption: SDS can be represented as a Markov Decision Process • Key benefit: Formalization (when in a state, what is the reward for taking a particular action, among all action choices?) ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Speech Recognition Speech Synthesis Reinforcement Learning and SDS LanguageUnderstanding noisy semantic input • debate over design choices • learn choices using reinforcement learning • agent interacting with an environment • noisy inputs • temporal / sequential aspect • task success / failure Dialogue Manager Domain Back-end actions (semantic output) Language Generation ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Sample Research Questions • Which aspects of dialogue management are amenable to learning and what reward functions are needed? • What representation of the dialogue state best serves this learning? • What reinforcement learningmethods are tractable with large scale dialogue systems? ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Markov Decision Processes (MDP) • Characterized by: • a set of states S an agent can be in • a set of actions A the agent can take • A reward r(a,s) that the agent receives for taking an action in a state • (+ Some other things I’ll come back to (gamma, state transition probabilities)) ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Modeling a Spoken Dialogue System as a Probabilistic Agent • A SDS can be characterized by: • The current knowledge of the system • A set of statesS the agent can be in • a set of actionsA the agent can take • A goalG, which implies • A success metric that tells us how well the agent achieved its goal • A way of using this metric to create a strategy or policy for what action to take in any particular state. ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Reinforcement Learning • The agent interacts with its environment to achieve a goal • It receives reward (possibly delayed reward) for its actions • it is not told what actions to take • instead, it learns from indirect, potentially delayed reward, to choose sequences of actions that produce the greatest cumulative reward • Trial-and-error search • neither exploitation nor exploration can be pursued exclusively without failing at the task • Life-long learning • on-going exploration ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

 ReinforcementLearning Policy  : S  A state reward action a0 a1 a2 . . . s0 s1 s2 r0 r1 r2 ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

State Value Function, V V(s) predicts the future total reward we can obtain by entering state s p(s0, a1, s1) = 0.7 s1 r(s0, a1) = 2  can exploit V greedily, i.e. in s, choose action a for which the following is largest: p(s0, a1, s2) = 0.3 s2 s0 p(s0, a2, s2) = 0.5 r(s0, a2) = 5 s3 p(s0, a2, s3) = 0.5 Choosing a1: 2 + 0.7 × 10 + 0.3 × 15 = 13.5 Choosing a2: 5 + 0.5 × 15 + 0.5 × 6 = 15.5 ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Action Value Function, Q Q(s, a) predicts the future total reward we can obtain by executing a in s  can exploit Q greedily, i.e. in s, choose action a for which Q(s, a) is largest s0 ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

One-step temporal difference update rule, TD(0) Q Learning Exploration versus exploitation For each (s, a), initialise Q(s, a) arbitrarily Observe current state, s Do until reach goal state Select action a by exploiting Q ε-greedily, i.e. with probability ε, choose a randomly; else choose the a for which Q(s, a) is largest Execute a, entering state s’ and receiving immediate reward r Update the table entry for Q(s, a) s  s’ Watkins 1989 ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

More on Q Learning s a Q(s, a) r s’ a’ Q(s’, a’) ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

A Brief Tutorial Example • A Day-and-Month dialogue system • Goal: fill in a two-slot frame: • Month: November • Day: 12th • Via the shortest possible interaction with user • Levin, E., Pieraccini, R. and Eckert, W. A Stochastic Model of Human-Machine Interaction for Learning Dialog Strategies. IEEE Transactions on Speech and Audio Processing. 2000. ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

What is a State? • In principle, MDP state could include any possible information about dialogue • Complete dialogue history so far • Usually use a much more limited set • Values of slots in current frame • Most recent question asked to user • Users most recent answer • ASR confidence • etc ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

State in the Day-and-Month Example • Values of the two slots day and month. • Total: • 2 special initial state si and sf. • 365 states with a day and month • 1 state for leap year • 12 states with a month but no day • 31 states with a day but no month • 411 total states ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Actions in MDP Models of Dialogue • Speech acts! • Ask a question • Explicit confirmation • Rejection • Give the user some database information • Tell the user their choices • Do a database query ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Actions in the Day-and-Month Example • ad: a question asking for the day • am: a question asking for the month • adm: a question asking for the day+month • af: a final action submitting the form and terminating the dialogue ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

A Simple Reward Function • For this example, let’s use a cost function for the entire dialogue • Let • Ni=number of interactions (duration of dialogue) • Ne=number of errors in the obtained values (0-2) • Nf=expected distance from goal • (0 for complete date, 1 if either data or month are missing, 2 if both missing) • Then (weighted) cost is: • C = wiNi + weNe + wfNf ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

3 Possible Policies Dumb P1=probability of error in open prompt Open prompt P2=probability of error in directive prompt Directive prompt ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

3 Possible Policies Strategy 3 is better than strategy 2 when improved error rate justifies longer interaction: P1=probability of error in open prompt OPEN P2=probability of error in directive prompt DIRECTIVE ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

That was an Easy Optimization • Only two actions, only tiny # of policies • In general, number of actions, states, policies is quite large • So finding optimal policy is harder • We need reinforcement learning • Back to MDPs: ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

MDP • We can think of a dialogue as a trajectory in state space • The best policy is the one with the greatest expected reward over all trajectories • How to compute a reward for a state sequence? ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Reward for a State Sequence • One common approach: discounted rewards • Cumulative reward Q of a sequence is discounted sum of utilities of individual states • Discount factor  between 0 and 1 • Makes agent care more about current than future rewards; the more future a reward, the more discounted its value ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

The Markov Assumption • MDP assumes that state transitions are Markovian ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Expected Reward for an Action • Expected cumulative reward Q(s,a) for taking a particular action from a particular state can be computed by Bellman equation: • immediate reward for current state • + expected discounted utility of all possible next states s’ • weighted by probability of moving to that state s’ • and assuming once there we take optimal action a’ ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Needed for Bellman Equation • A model of p(s’|s,a) and estimate of R(s,a) • If we had labeled training data • P(s’|s,a) = C(s,s’,a)/C(s,a) • If we knew the final reward for whole dialogue R(s1,a1,s2,a2,…,sn) • Given these parameters, can use value iteration algorithm to learn Q values (pushing back reward values over state sequences) and hence best policy ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

How to estimate p(s’|s,a) without labeled data • Random conversations with real people • Carefully hand-tune small number of states and policies • Build a dialogue system which explores state space by generating a few hundred random conversations with real humans • Set probabilities from this corpus • Random conversations with simulated people • Have millions of conversations with simulated people • Can have a slightly larger state space ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Final Reward • What is the final reward for whole dialogue R(s1,a1,s2,a2,…,sn)? • The general goodness of a whole dialogue! • Can be automatically computed using PARADISE evaluation framework (more later) ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

NJFun: A Case Study • Idea: use reinforcement learning to make a small setof optimal policydecisions while interacting with real people, and empirically evaluate • Singh, S., D. Litman, M. Kearns, and M. Walker. Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System. Journal of AI Research. 2002. ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

NJFun (aka RLDS) Dialogues Task Description:"You have lived in Stanhope for many years but have never managed to visit its historic sites. Today you are feeling virtuous. Use RLDS to find out what you can see this morning.” Unoptimized System Optimized System TTS: Welcome to RLDS. Please say an activity name or say 'list activities' for a list of activities I know about. USER: I would like to visit a historic site in Stanhope this morning. TTS: Did you say you are interested in going to a zoo? USER: No. TTS: I know about amusement parks, aquariums, cruises, historic sites, museums, parks, theaters, wineries, and zoos. Please say an activity name from this list. USER: I would like to visit a historic site. TTS: Did you say you are interested in going to a historic site? USER: Yes. TTS: Please give me more information. USER: ... TTS: Please tell me the location that you are interested in. You can also tell me the time. …. (8 turns omitted) TTS: Thank you for using the system. Please give me feedback by saying 'good', 'so-so', or 'bad'. USER: Bad. TTS: Goodbye and have a nice day! TTS: Welcome to RLDS. How may I help you? USER: I would like to um find about Stanhope historic sites. TTS: Did you say you are interested in Stanhope? USER: Yes. TTS: What time of the day do you want to go? USER: Uh...morning. TTS: Did you say you want to go in the morning? USER: Yes. TTS: I found 2 historic sites in Stanhope that are open in the morning. They are the Lenape Indian Village, and the Waterloo Village. Thank you for using the system. Please give me feedback by saying 'good', 'so-so', or 'bad'. USER: Good. TTS: Goodbye and have a nice day! ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Some Issues in Automatic Speech Recognition (ASR) • Inputs: audio file; grammar/language model; acoustic model • Outputs: utterance matched from grammar, or no match; confidence score • Performance tradeoff: • “small” grammar --> high accuracy on constrained utterances, lots of no-matches • “large” grammar --> match more utterances, but with lower confidence ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Some Issues in Dialogue Policy Design • Initiative policy • Confirmation policy • Criteria to be optimized ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Initiative Policy • System initiative vs. user initiative: • “Please state your departure city.” • “How can I help you?” • Influences expectations • ASR grammar must be chosen accordingly • Best choice may differ from state to state • May depend on user population & task ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Confirmation Policy • High ASR confidence: accept ASR match and move on • Moderate ASR confidence: confirm • Low ASR confidence: re-ask • How to set confidence thresholds? • Early mistakes can be costly later, but excessive confirmation is annoying ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Criteria to be Optimized • Task completion • Sales revenues • User satisfaction • ASR performance • Number of turns ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Typical System Design: Sequential Search • Choose and implement several “reasonable” dialogue policies • Field systems, gather dialogue data • Do statistical analyses • Refield system with “best” dialogue policy • Can only examine a handful of policies ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Why Reinforcement Learning? • Agents can learn to improve performance by interacting with their environment • Thousands of possible dialogue policies, and want to automate the choice of the “optimal” • Can handle many features of spoken dialogue • noisy sensors (ASR output) • stochastic behavior (user population) • delayed rewards, and many possible rewards • multiple plausible actions • However, many practical challenges remain ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Proposed Approach • Build initial system that is deliberately exploratory wrt state and action space • Use dialogue data from initial system to build a Markov decision process(MDP) • Use methods of reinforcement learning to compute optimal policy (here, dialogue policy) of the MDP • Refield (improved?) system given by the optimal policy • Empirically evaluate ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

State-Based Design • System state: contains information relevant for deciding the next action • info attributes perceived so far • individual and average ASR confidences • data on particular user • etc. • In practice, need a compressed state • Dialogue policy: mapping from each state in the state space to a system action ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Markov Decision Processes • System state s (in S) • System action a in (in A) • Transition probabilities P(s’|s,a) • Reward functionR(s,a) (stochastic) • Our application: P(s’|s,a) models the population of users ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

SDSs as MDPs Initial system utterance Initial user utterance Actions have prob. outcomes + system logs a e a e a e ... 1 1 2 2 3 3 estimate transition probabilities... P(next state | current state & action) ...and rewards... R(current state, action) ...from set ofexploratorydialogues (random action choice) Violations of Markov property! Will this work? ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Computing the Optimal • Given parameters P(s’|s,a), R(s,a), can efficiently compute policy maximizing expected return • Typically compute the expected cumulative reward (or Q-value) Q(s,a), using value iteration • Optimal policy selects the action with the maximum Q-value at each dialogue state ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Potential Benefits • A principled and general framework for automated dialogue policy synthesis • learn the optimal action to take in each state • Compares all policies simultaneously • data efficient because actions are evaluated as a function of state • traditional methods evaluate entire policies • Potential for “lifelong learning” systems, adapting to changing user populations ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

The Application: NJFun • Dialogue system providing telephone access to a DB of activities in NJ • Want to obtain 3 attributes: • activity type (e.g., wine tasting) • location (e.g., Lambertville) • time (e.g., morning) • Failure to bind an attribute: query DB with don’t-care ACL/HCSNet Advanced Program In Natural Language Processing (University of Melbourne)

Learning Optimal Strategies for Spoken Dialogue Systems

Learning Optimal Strategies for Spoken Dialogue Systems

Presentation Transcript

Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Systems: System Overview

Spoken Dialogue Systems

Spoken Dialogue Systems A Tutorial

Building Spoken Dialogue Systems for Embodied Agents

Evaluating Spoken Dialogue Systems

Discourse Annotation for Improving Spoken Dialogue Systems

Spoken Dialogue Systems

User Simulation for Spoken Dialogue Systems

Discourse Annotation for Improving Spoken Dialogue Systems

Spoken Dialogue Systems

Learning, Adaptation and Personalization in Spoken Dialogue Systems

Components of Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Systems and the Learning Sciences

Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Systems: Human and Machine

Spoken Dialogue Systems