using reinforcement learning to build a better model of dialogue state
Download
Skip this Video
Download Presentation
Using Reinforcement Learning to Build a Better Model of Dialogue State

Loading in 2 Seconds...

play fullscreen
1 / 39

Using Reinforcement Learning to Build a Better Model of Dialogue State - PowerPoint PPT Presentation


  • 72 Views
  • Uploaded on

Using Reinforcement Learning to Build a Better Model of Dialogue State. Joel Tetreault & Diane Litman University of Pittsburgh LRDC April 7, 2006. Problem. Problems with designing spoken dialogue systems: What features to use? How to handle noisy data or miscommunications?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Using Reinforcement Learning to Build a Better Model of Dialogue State' - sakura


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
using reinforcement learning to build a better model of dialogue state

Using Reinforcement Learning to Build a Better Model of Dialogue State

Joel Tetreault & Diane Litman

University of Pittsburgh

LRDC

April 7, 2006

problem
Problem
  • Problems with designing spoken dialogue systems:
    • What features to use?
    • How to handle noisy data or miscommunications?
    • Hand-tailoring policies for complex dialogues?
  • Previous work used machine learning to improve the dialogue manager of spoken dialogue systems [Singh et al., ‘02; Walker, ‘00; Henderson et al., ‘05]
  • However, very little empirical work on testing the utility of adding specialized features to construct a better dialogue state
slide3
Goal
  • Lots of features can be used to describe the user state, which ones to you use?
  • Goal: show that adding more complex features to a state is a worthwhile pursuit since it alters what actions a system should make
  • 5 features: certainty, student dialogue move, concept repetition, frustration, student performance
  • All are important to tutoring systems, but also are important to dialogue systems in general
outline
Outline
  • Markov Decision Processes (MDP)
  • MDP Instantiation
  • Experimental Method
  • Results
markov decision processes
Markov Decision Processes
  • What is the best action an agent to take at any state to maximize reward at the end?
  • MDP Input:
    • States
    • Actions
    • Reward Function
mdp output
MDP Output
  • Use policy iteration to propagate final reward to the states to determine:
    • V-value: the worth of each state
    • Policy: optimal action to take for each state
  • Values and policies are based on the reward function but also on the probabilities of getting from one state to the next given a certain action
mdp frog example
MDP Frog Example

Final State: +1

-1

-1

-1

-1

-1

-1

-1

mdp frog example1
MDP Frog Example

Final State: +1

-1

0

-2

-1

0

-2

-3

-2

mdp s in spoken dialogue
MDP’s in Spoken Dialogue

MDP works offline

MDP

Training data

Policy

Dialogue

System

User

Simulator

Human

User

Interactions work online

itspoke corpus
ITSPOKE Corpus
  • 100 dialogues with ITSPOKE spoken dialogue tutoring system [Litman et al. ’04]
    • All possible dialogue paths were authored by physics experts
    • Dialogues informally follow question-answer format
    • 50 turns per dialogue on average
  • Each student session has 5 dialogues bookended by a pretest and posttest to calculate how much student learned
corpus annotations
Corpus Annotations
  • Manual annotations:
    • Tutor and Student Moves (similar to Dialog Acts) [Forbes-Riley et al., ’05]
    • Frustration and certainty [Litman et al. ’04] [Liscombe et al. ’05]
  • Automated annotations:
    • Correctness (based on student’s response to last question)
    • Concept Repetition (whether a concept is repeated)
    • %Correctness (past performance)
mdp reward function
MDP Reward Function
  • Reward Function: use normalized learning gain to do a median split on corpus:
  • 10 students are “high learners” and the other 10 are “low learners”
  • High learner dialogues had a final state with a reward of +100, low learners had one of -100
infrastructure
Infrastructure
  • 1. State Transformer:
    • Based on RLDS [Singh et al., ’99]
    • Outputs State-Action probability matrix and reward matrix
  • 2. MDP Matlab Toolkit (from INRA) to generate policies
methodology
Methodology
  • Construct MDP’s to test the inclusion of new state features to a baseline:
    • Develop baseline state and policy
    • Add a feature to baseline and compare polices
    • A feature is deemed important if adding it results in a change in policy from a baseline policy (“shifts”)
  • For each MDP: verify policies are reliable (V-value convergence)
tests
Tests

B2+

+SMove

+Goal

B1+

Correctness

+Certainty

+Frustration

Baseline 2

Baseline 1

+%Correct

baseline
Baseline
  • Actions: {Feed, NonFeed, Mix}
  • Baseline State: {Correctness}

Baseline network

F|NF|Mix

[C]

[I]

F|NF|Mix

F|NF|Mix

F|NF|Mix

F|NF|Mix

FINAL

baseline 1 policies
Baseline 1 Policies
  • Trend: if you only have student correctness as a model of student state, regardless of their response, the best tactic is to always give simple feedback
but are our policies reliable
But are our policies reliable?
  • Best way to test is to run real experiments with human users with new dialogue manager, but that is months of work
  • Our tact: check if our corpus is large enough to develop reliable policies by seeing if V-values converge as we add more data to corpus
  • Method: run MDP on subsets of our corpus (incrementally add a student (5 dialogues) to data, and rerun MDP on each subset)
methodology adding more features
Methodology: Adding more Features
  • Create more complicated baseline by adding certainty feature (new baseline = B2)
  • Add other 4 features (student moves, concept repetition, frustration, performance) individually to new baseline
  • Check that V-values converge
  • Analyze policy changes
tests1
Tests

B2+

+SMove

+Goal

B1+

Correctness

+Certainty

+Frustration

Baseline 2

Baseline 1

+%Correct

certainty
Certainty
  • Previous work (Bhatt et al., ’04) has shown the importance of certainty in ITS
  • A student who is certain and correct, may not need feedback, but one that is correct but showing some doubt is a sign they are becoming confused, give more feedback
b2 baseline certainty policies
B2: Baseline + Certainty Policies

Trend: if neutral, give Feed or Mix, else give NonFeed

tests2
Tests

B2+

+SMove

+Goal

B1+

Correctness

+Certainty

+Frustration

Baseline 2

Baseline 1

+ %Correct

student move policies
7 ChangesStudent Move Policies

Trend: give Mix if shallow (S), give NonFeed if Other (O)

concept repetition policies
4 ShiftsConcept Repetition Policies

Trend: if concept is repeated (R) give complex or mix feedback

frustration policies
4 ShiftsFrustration Policies

Trend: if student is frustrated (F), give NonFeed

percent correct policies
3 ShiftsPercent Correct Policies

Trend: if student is a low performer (L), give NonFeed

discussion
Discussion
  • Incorporating more information into a representation of the student state has an impact on tutor policies
  • Despite not having human or simulated users, can still claim that our findings are reliable due to convergence of V-values and policies
  • Including Certainty, Student Moves and Concept Repetition effected the most change
future work
Future Work
  • Developing user simulations and annotating more human-computer experiments to further verify our policies are correct
  • More data allows us to develop more complicated policies such as
    • More complex tutor actions (hints, questions)
    • Combinations of state features
    • More refined reward functions (PARADISE)
  • Developing more complex convergence tests
related work
Related Work
  • [Paek and Chickering, ‘05]
  • [Singh et al., ‘99] – optimal dialogue length
  • [Frampton et al., ‘05] – last dialogue act
  • [Williams et al., ‘03] – automatically generate good state/action sets
diff plots
Diff Plots

Diff Plot: compare final policy (20 students) with policies generated at smaller cuts

ad