Using Reinforcement Learning to Build a Better Model of Dialogue State

Using Reinforcement Learning to Build a Better Model of Dialogue State Joel Tetreault & Diane Litman University of Pittsburgh LRDC April 7, 2006

Problem • Problems with designing spoken dialogue systems: • What features to use? • How to handle noisy data or miscommunications? • Hand-tailoring policies for complex dialogues? • Previous work used machine learning to improve the dialogue manager of spoken dialogue systems [Singh et al., ‘02; Walker, ‘00; Henderson et al., ‘05] • However, very little empirical work on testing the utility of adding specialized features to construct a better dialogue state

Goal • Lots of features can be used to describe the user state, which ones to you use? • Goal: show that adding more complex features to a state is a worthwhile pursuit since it alters what actions a system should make • 5 features: certainty, student dialogue move, concept repetition, frustration, student performance • All are important to tutoring systems, but also are important to dialogue systems in general

Outline • Markov Decision Processes (MDP) • MDP Instantiation • Experimental Method • Results

Markov Decision Processes • What is the best action an agent to take at any state to maximize reward at the end? • MDP Input: • States • Actions • Reward Function

MDP Output • Use policy iteration to propagate final reward to the states to determine: • V-value: the worth of each state • Policy: optimal action to take for each state • Values and policies are based on the reward function but also on the probabilities of getting from one state to the next given a certain action

What’s the best path to the fly?

MDP Frog Example Final State: +1 -1 -1 -1 -1 -1 -1 -1

MDP Frog Example Final State: +1 -1 0 -2 -1 0 -2 -3 -2

MDP’s in Spoken Dialogue MDP works offline MDP Training data Policy Dialogue System User Simulator Human User Interactions work online

ITSPOKE Corpus • 100 dialogues with ITSPOKE spoken dialogue tutoring system [Litman et al. ’04] • All possible dialogue paths were authored by physics experts • Dialogues informally follow question-answer format • 50 turns per dialogue on average • Each student session has 5 dialogues bookended by a pretest and posttest to calculate how much student learned

Corpus Annotations • Manual annotations: • Tutor and Student Moves (similar to Dialog Acts) [Forbes-Riley et al., ’05] • Frustration and certainty [Litman et al. ’04] [Liscombe et al. ’05] • Automated annotations: • Correctness (based on student’s response to last question) • Concept Repetition (whether a concept is repeated) • %Correctness (past performance)

MDP State Features

MDP Action Choices

MDP Reward Function • Reward Function: use normalized learning gain to do a median split on corpus: • 10 students are “high learners” and the other 10 are “low learners” • High learner dialogues had a final state with a reward of +100, low learners had one of -100

Infrastructure • 1. State Transformer: • Based on RLDS [Singh et al., ’99] • Outputs State-Action probability matrix and reward matrix • 2. MDP Matlab Toolkit (from INRA) to generate policies

Methodology • Construct MDP’s to test the inclusion of new state features to a baseline: • Develop baseline state and policy • Add a feature to baseline and compare polices • A feature is deemed important if adding it results in a change in policy from a baseline policy (“shifts”) • For each MDP: verify policies are reliable (V-value convergence)

Hypothetical Policy Change Example 0 shifts 5 shifts

Tests B2+ +SMove +Goal B1+ Correctness +Certainty +Frustration Baseline 2 Baseline 1 +%Correct

Baseline • Actions: {Feed, NonFeed, Mix} • Baseline State: {Correctness} Baseline network F|NF|Mix [C] [I] F|NF|Mix F|NF|Mix F|NF|Mix F|NF|Mix FINAL

Baseline 1 Policies • Trend: if you only have student correctness as a model of student state, regardless of their response, the best tactic is to always give simple feedback

But are our policies reliable? • Best way to test is to run real experiments with human users with new dialogue manager, but that is months of work • Our tact: check if our corpus is large enough to develop reliable policies by seeing if V-values converge as we add more data to corpus • Method: run MDP on subsets of our corpus (incrementally add a student (5 dialogues) to data, and rerun MDP on each subset)

Baseline Convergence Plot

Methodology: Adding more Features • Create more complicated baseline by adding certainty feature (new baseline = B2) • Add other 4 features (student moves, concept repetition, frustration, performance) individually to new baseline • Check that V-values converge • Analyze policy changes

Tests B2+ +SMove +Goal B1+ Correctness +Certainty +Frustration Baseline 2 Baseline 1 +%Correct

Certainty • Previous work (Bhatt et al., ’04) has shown the importance of certainty in ITS • A student who is certain and correct, may not need feedback, but one that is correct but showing some doubt is a sign they are becoming confused, give more feedback

B2: Baseline + Certainty Policies Trend: if neutral, give Feed or Mix, else give NonFeed

Baseline 1 and 2 Convergence Plots

Tests B2+ +SMove +Goal B1+ Correctness +Certainty +Frustration Baseline 2 Baseline 1 + %Correct

% Correct Convergence Plots

7 Changes Student Move Policies Trend: give Mix if shallow (S), give NonFeed if Other (O)

4 Shifts Concept Repetition Policies Trend: if concept is repeated (R) give complex or mix feedback

4 Shifts Frustration Policies Trend: if student is frustrated (F), give NonFeed

3 Shifts Percent Correct Policies Trend: if student is a low performer (L), give NonFeed

Discussion • Incorporating more information into a representation of the student state has an impact on tutor policies • Despite not having human or simulated users, can still claim that our findings are reliable due to convergence of V-values and policies • Including Certainty, Student Moves and Concept Repetition effected the most change

Future Work • Developing user simulations and annotating more human-computer experiments to further verify our policies are correct • More data allows us to develop more complicated policies such as • More complex tutor actions (hints, questions) • Combinations of state features • More refined reward functions (PARADISE) • Developing more complex convergence tests

Related Work • [Paek and Chickering, ‘05] • [Singh et al., ‘99] – optimal dialogue length • [Frampton et al., ‘05] – last dialogue act • [Williams et al., ‘03] – automatically generate good state/action sets

Diff Plots Diff Plot: compare final policy (20 students) with policies generated at smaller cuts

Using Reinforcement Learning to Build a Better Model of Dialogue State

Using Reinforcement Learning to Build a Better Model of Dialogue State

Presentation Transcript

Reinforcement Learning: A Tutorial

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Introduction to Reinforcement Learning

Reinforcement Learning

REINFORCEMENT LEARNING

Applying reinforcement learning to Tetris A reduction in state space

Introduction to Reinforcement Learning

Reinforcement Learning: A survey

Reinforcement Learning

Model Minimization in Hierarchical Reinforcement Learning

Reinforcement Learning: A Tutorial

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Introduction to Reinforcement Learning