Using reinforcement learning to build a better model of dialogue state
Download
1 / 39

Using Reinforcement Learning to Build a Better Model of Dialogue State - PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on

Using Reinforcement Learning to Build a Better Model of Dialogue State. Joel Tetreault & Diane Litman University of Pittsburgh LRDC April 7, 2006. Problem. Problems with designing spoken dialogue systems: What features to use? How to handle noisy data or miscommunications?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Using Reinforcement Learning to Build a Better Model of Dialogue State' - sakura


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Using reinforcement learning to build a better model of dialogue state

Using Reinforcement Learning to Build a Better Model of Dialogue State

Joel Tetreault & Diane Litman

University of Pittsburgh

LRDC

April 7, 2006


Problem
Problem Dialogue State

  • Problems with designing spoken dialogue systems:

    • What features to use?

    • How to handle noisy data or miscommunications?

    • Hand-tailoring policies for complex dialogues?

  • Previous work used machine learning to improve the dialogue manager of spoken dialogue systems [Singh et al., ‘02; Walker, ‘00; Henderson et al., ‘05]

  • However, very little empirical work on testing the utility of adding specialized features to construct a better dialogue state


Goal Dialogue State

  • Lots of features can be used to describe the user state, which ones to you use?

  • Goal: show that adding more complex features to a state is a worthwhile pursuit since it alters what actions a system should make

  • 5 features: certainty, student dialogue move, concept repetition, frustration, student performance

  • All are important to tutoring systems, but also are important to dialogue systems in general


Outline
Outline Dialogue State

  • Markov Decision Processes (MDP)

  • MDP Instantiation

  • Experimental Method

  • Results


Markov decision processes
Markov Decision Processes Dialogue State

  • What is the best action an agent to take at any state to maximize reward at the end?

  • MDP Input:

    • States

    • Actions

    • Reward Function


Mdp output
MDP Output Dialogue State

  • Use policy iteration to propagate final reward to the states to determine:

    • V-value: the worth of each state

    • Policy: optimal action to take for each state

  • Values and policies are based on the reward function but also on the probabilities of getting from one state to the next given a certain action



Mdp frog example
MDP Frog Example Dialogue State

Final State: +1

-1

-1

-1

-1

-1

-1

-1


Mdp frog example1
MDP Frog Example Dialogue State

Final State: +1

-1

0

-2

-1

0

-2

-3

-2


Mdp s in spoken dialogue
MDP’s in Spoken Dialogue Dialogue State

MDP works offline

MDP

Training data

Policy

Dialogue

System

User

Simulator

Human

User

Interactions work online


Itspoke corpus
ITSPOKE Corpus Dialogue State

  • 100 dialogues with ITSPOKE spoken dialogue tutoring system [Litman et al. ’04]

    • All possible dialogue paths were authored by physics experts

    • Dialogues informally follow question-answer format

    • 50 turns per dialogue on average

  • Each student session has 5 dialogues bookended by a pretest and posttest to calculate how much student learned


Corpus annotations
Corpus Annotations Dialogue State

  • Manual annotations:

    • Tutor and Student Moves (similar to Dialog Acts) [Forbes-Riley et al., ’05]

    • Frustration and certainty [Litman et al. ’04] [Liscombe et al. ’05]

  • Automated annotations:

    • Correctness (based on student’s response to last question)

    • Concept Repetition (whether a concept is repeated)

    • %Correctness (past performance)


Mdp state features
MDP State Features Dialogue State


Mdp action choices
MDP Action Choices Dialogue State


Mdp reward function
MDP Reward Function Dialogue State

  • Reward Function: use normalized learning gain to do a median split on corpus:

  • 10 students are “high learners” and the other 10 are “low learners”

  • High learner dialogues had a final state with a reward of +100, low learners had one of -100


Infrastructure
Infrastructure Dialogue State

  • 1. State Transformer:

    • Based on RLDS [Singh et al., ’99]

    • Outputs State-Action probability matrix and reward matrix

  • 2. MDP Matlab Toolkit (from INRA) to generate policies


Methodology
Methodology Dialogue State

  • Construct MDP’s to test the inclusion of new state features to a baseline:

    • Develop baseline state and policy

    • Add a feature to baseline and compare polices

    • A feature is deemed important if adding it results in a change in policy from a baseline policy (“shifts”)

  • For each MDP: verify policies are reliable (V-value convergence)


Hypothetical policy change example
Hypothetical Policy Change Example Dialogue State

0 shifts

5 shifts


Tests
Tests Dialogue State

B2+

+SMove

+Goal

B1+

Correctness

+Certainty

+Frustration

Baseline 2

Baseline 1

+%Correct


Baseline
Baseline Dialogue State

  • Actions: {Feed, NonFeed, Mix}

  • Baseline State: {Correctness}

Baseline network

F|NF|Mix

[C]

[I]

F|NF|Mix

F|NF|Mix

F|NF|Mix

F|NF|Mix

FINAL


Baseline 1 policies
Baseline 1 Policies Dialogue State

  • Trend: if you only have student correctness as a model of student state, regardless of their response, the best tactic is to always give simple feedback


But are our policies reliable
But are our policies reliable? Dialogue State

  • Best way to test is to run real experiments with human users with new dialogue manager, but that is months of work

  • Our tact: check if our corpus is large enough to develop reliable policies by seeing if V-values converge as we add more data to corpus

  • Method: run MDP on subsets of our corpus (incrementally add a student (5 dialogues) to data, and rerun MDP on each subset)



Methodology adding more features
Methodology: Adding more Features Dialogue State

  • Create more complicated baseline by adding certainty feature (new baseline = B2)

  • Add other 4 features (student moves, concept repetition, frustration, performance) individually to new baseline

  • Check that V-values converge

  • Analyze policy changes


Tests1
Tests Dialogue State

B2+

+SMove

+Goal

B1+

Correctness

+Certainty

+Frustration

Baseline 2

Baseline 1

+%Correct


Certainty
Certainty Dialogue State

  • Previous work (Bhatt et al., ’04) has shown the importance of certainty in ITS

  • A student who is certain and correct, may not need feedback, but one that is correct but showing some doubt is a sign they are becoming confused, give more feedback


B2 baseline certainty policies
B2: Baseline + Certainty Policies Dialogue State

Trend: if neutral, give Feed or Mix, else give NonFeed



Tests2
Tests Dialogue State

B2+

+SMove

+Goal

B1+

Correctness

+Certainty

+Frustration

Baseline 2

Baseline 1

+ %Correct



Student move policies

7 Changes Dialogue State

Student Move Policies

Trend: give Mix if shallow (S), give NonFeed if Other (O)


Concept repetition policies

4 Shifts Dialogue State

Concept Repetition Policies

Trend: if concept is repeated (R) give complex or mix feedback


Frustration policies

4 Shifts Dialogue State

Frustration Policies

Trend: if student is frustrated (F), give NonFeed


Percent correct policies

3 Shifts Dialogue State

Percent Correct Policies

Trend: if student is a low performer (L), give NonFeed


Discussion
Discussion Dialogue State

  • Incorporating more information into a representation of the student state has an impact on tutor policies

  • Despite not having human or simulated users, can still claim that our findings are reliable due to convergence of V-values and policies

  • Including Certainty, Student Moves and Concept Repetition effected the most change


Future work
Future Work Dialogue State

  • Developing user simulations and annotating more human-computer experiments to further verify our policies are correct

  • More data allows us to develop more complicated policies such as

    • More complex tutor actions (hints, questions)

    • Combinations of state features

    • More refined reward functions (PARADISE)

  • Developing more complex convergence tests


Related work
Related Work Dialogue State

  • [Paek and Chickering, ‘05]

  • [Singh et al., ‘99] – optimal dialogue length

  • [Frampton et al., ‘05] – last dialogue act

  • [Williams et al., ‘03] – automatically generate good state/action sets


Diff plots
Diff Plots Dialogue State

Diff Plot: compare final policy (20 students) with policies generated at smaller cuts


ad