Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich

Reinforcement Learning via Practice and Critique Advice Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich AAMAS-ALIHT 2010 Workshop Toronto, Canada TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Reinforcement Learning (RL) • PROBLEM: • Usually RL takes a long time (hundreds of episodes) to learn a good policy. • We found OLPOMDP taking 500 episodes in our Real Time Strategy (RTS) game Wargus, still not able to win convincingly against enemy footmen. • GOALS: • Non-technical users as teachers • Natural interaction methods Teacher advice behavior Environment state action reward RESEARCH QUESTION:Can we make RL perform better with some outside help, such as critique/advice from teacher and how?

RL via Practice + Critique Advice In a state siaction ai is bad, whereas action aj is good. Teacher Practice Session Critique Session Advice Interface Trajectory Data Critique Data ? Policy Parameters 

Solution Approach In a state siaction ai is bad, whereas action aj is good. Teacher Practice Session Critique Session Advice Interface Trajectory Data Critique Data Policy Parameters 

Critique Data Loss L(θ,C) Some good actions Some bad actions Advice Interface Some actions unlabeled Imagine: Our teacher is an IdealTeacher (Provides All Good Actions) Set of all Good actions All actions are equally good Advice Interface Any action not in O(si) is suboptimal according to Ideal Teacher Ideal Teacher

‘Any Label Learning’ (ALL) • Learning Goal: Find a probabilistic policy, or classifier, that has a high probability of returning an action in O(s) when applied to s. Imagine: Our teacher is an IdealTeacher (Provides All Good Actions) Set of all Good actions All actions are equally good Advice Interface • ALLLikelihood (LALL(,C)) : Any action not in O(si) is suboptimal according to Ideal Teacher Ideal Teacher Probability of selecting an action in O(Si) given state si

Critique Data Loss L(θ,C) • Coming back to reality: Not All Teachers are Ideal ! and provide partial evidence about O(si) Advice Interface • What about the naïve approach of treating as the true set O(si) ? • Difficulties: • When there are actions outside of that are equally good compared to those in , the learning problem becomes even harder. • We want a principled way of handling the situation where either or can be empty.

Expected Any-Label Learning and provide partial evidence about O(si) User Model • Assume independence • among different states. … We can get from corresponding for all states Expected Loss: • The Gradient of Expected Losshas a compact closed form.

Experimental Setup Map 1 Map 2 • Our Domain: Micro-management in tactical battles in the Real Time Strategy (RTS) game of Wargus. • 5 friendly footmen against a group of 5 enemy footmen (Wargus AI). • Two battle maps, which differed only in the initial placement of the units. • Both maps had winning strategies for the friendly team and are of roughly the same difficulty.

Advice Interface • Difficulty: • Fast pace and multiple units acting in parallel • Our setup: • Provide end-users with an Advice Interface that allows to watch a battle and pause at any moment.

User Study • Goal is to evaluate two systems • Supervised System = no practice session • Combined System = includes practice and critique • The user study involved 10 end-users • 6 with CS background • 4 no CS background • Each user trained both the supervised and combined systems • 30 minutes total for supervised • 60 minutes for combined due to additional time for practice • Since repeated runs are not practical results are qualitative • To provide statistical results we first present simulated experiments

Simulated Experiments • After user study, selected the worst and best performing users on each map when training the combined system • Total Critique data: User#1 – 36, User#2 – 91, User#3 – 115, User#4 – 33. • For each user: divide critique data into 4 equal sized segments creating four data-sets per user containing 25%, 50%, 75%, and 100% of their respective critique data. • We provided the combined system with each of these data sets and allowed it to practice for 100 episodes. All results are averaged over 5 runs.

Simulated Experiments Results: Benefit of Critiques (User #1) • RL is unable to learn a winning policy (i.e. achieve a positive value).

Simulated Experiments Results: Benefit of Critiques (User #1) • With more critiques performance increases a little bit.

Simulated Experiments Results: Benefit of Critiques (User #1) • As the amount of critique data increases, the performance improves for a fixed number of practice episodes. • RL did not go past 12 health difference on any map even after 500 trajectories.

Simulated Experiments Results: Benefit of Practice (User #1) • Even with no practice, the critique data was sufficient to outperform RL. • RL did not go past 12 health difference.

Simulated Experiments Results: Benefit of Practice (User #1) • With more practice performance increases too.

Simulated Experiments Results: Benefit of Practice (User #1) • Our approach is able to leverage practice episodes in order to improve the effectiveness on a given amount of critique data.

Results for Actual User Study • Goal is to evaluate two systems • Supervised System = no practice session • Combined System = includes practice and critique • The user study involved 10 end-users • 6 with CS background • 4 no CS background • Each user trained both the supervised and combined systems • 30 minutes total for supervised • 60 minutes for combined due to additional time for practice

Results of User Study

Results of User Study • Comparing to RL: • 9 out of 10 users achieved 50 or more performance using Supervised System • 6 out of 10 users achieved 50 or more performance using Combined System • RL achieved -7.4 on Map1 and 12.2 on Map 2 after 500 episodes • Users effectively performed better than RL using either the Supervised or Combined Systems.

Results of User Study • Comparing Combined and Supervised: • The end-users had slightly greater success with the supervised system v/s the combined system. • More users were able to achieve performance levels of 50 and 80 using the supervised system. • Frustrating Problems for Users  • Large delay experience. (not an issue in many use settings) • Policy returned after practice was sometimes poor, seemed to be ignoring advice. (perhaps practice sessions too short)

Future Work • An important part of our future work will be to conduct further user studies in order to pursue the most relevant directions including: • Studying more sophisticated user models that further approximate real users. • Enriching the forms of advice. • Increasing the stability of autonomous practice. • Designing realistic user studies.

 Questions ?

Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich