Kshitij Judah EECS, OSU

Developing Learning Systems that Interact and Learn from Human Teachers Kshitij Judah EECS, OSU Dissertation Proposal Presentation

Outline • PART I: User-Initiated Learning • PART II: RL via Practice and Critique Advice • PART III: Proposed future directions for the PhD program • Extending RL via practice and critique advice • Active Learning for sequential decision making

PART I User-Initiated Learning

User-Initiated Learning (UIL) • All of CALO’s learning components can perform Learning In The Wild (LITW) • But the learning tasks are all pre-defined by CALO’s engineers: • What to learn • What information is relevant for learning • How to acquire training examples • How to apply the learned knowledge • UIL Goal: Make it possible for the user to define new learning tasks after the system is deployed

Motivating Scenario: Forgetting to Set Sensitivity Scientist T I M E L I N E Collaborates on a Classified project Research Team Sends email to team Sets sensitivity to confidential “Lunch today?” Sends email to a colleague Does not set sensitivity to confidential Sends email to team Forgets to set sensitivity to confidential

Motivating Scenario: Forgetting to Set Sensitivity Scientist “Please do not forget to set sensitivity when sending email” T I M E L I N E Research Team Teaches CALO to learn to predict whether user has forgot to set sensitivity Sends email to team CALO reminds user to set sensitivity

User-CALO Interaction: Teaching CALO to Predict Sensitivity Procedure Demonstration and Learning Task Creation Integrated Task Learning Instrumented Outlook Events Compose new email Modify Procedure user user SPARK Procedure Feature Guidance Learning User Interface for Feature Guidance SAT Based Reasoning System Machine Learner Legal Features User Selected Features Trained Classifier Class Labels Training Examples Feature Guidance Email + Related Objects Knowledge Base SAT Based Reasoning System user CALO Ontology

Assisting the User: Reminding

The Learning Component • Logistic Regression is used as the core learning algorithm • Features • Relational features extracted from ontology • Incorporate User Advice on Features • Apply large prior variance on user selected features • Select prior variance on rest of the features through cross-validation • Automated Model Selection • Parameters: Prior variance on weights, classification threshold • Technique: Maximization of leave-one-out cross-validation estimate of kappa ()

Empirical Evaluation • Problems: • Attachment Prediction • Importance Prediction • Learning Configurations Compared: • No User Advice + Fixed Model Parameters • User Advice + Fixed Model Parameters • No User Advice + Automatic parameter Tuning • User Advice + Automatic parameter Tuning • User Advice: 18 keywords in the body text for each problem

Empirical Evaluation: Data Set • Set of 340 emails obtained from a real desktop user • 256 training set + 84 test set • For each training set size, compute mean kappa () using test set to generate learning curves •  is a statistical measure of inter-rater agreement for discrete classes •  is a common evaluation metric in cases when the classes have a skewed distribution

Empirical Evaluation: Learning Curves Attachment Prediction

Empirical Evaluation: Learning Curves Importance Prediction

Empirical Evaluation: Robustness to Bad Advice • We intended to test the robustness of the system to bad advice • Bad advice was generated as follows: • Use SVM based feature selection in WEKA to produce a ranking of user provided keywords • Replace top three words in the ranking with randomly selected words from the vocabulary

Empirical Evaluation: Robustness to Bad Advice Attachment Prediction

Empirical Evaluation: Robustness to Bad Advice Importance Prediction

Lessons Learned • User interfaces should support rich instrumentation, automation, and intervention • User interfaces should come with models of their behavior • User advice is helpful but not critical • Self-tuning learning algorithms are critical for success

PART II Reinforcement Learning via Practice and Critique Advice TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Teacher advice behavior Reinforcement Learning (RL) • PROBLEM: • Usually RL takes a long time to learn a good policy. • GOALS: • Non-technical users as teachers • Natural interaction methods Environment state action reward RESEARCH QUESTION:Can we make RL perform better with some outside help, such as critique/advice from teacher and how?

RL via Practice + Critique Advice In a state siaction ai is bad, whereas action aj is good. Teacher Practice Session Critique Session Advice Interface Trajectory Data Critique Data ? Policy Parameters 

Solution Approach In a state siaction ai is bad, whereas action aj is good. Teacher Practice Session Critique Session Advice Interface Trajectory Data Critique Data Estimate Expected Utility using Importance Sampling. (Peshkin & Shelton, ICML 2002) Policy Parameters 

Critique Data Loss L(θ,C) Some good actions Some bad actions Advice Interface Some actions unlabeled Imagine: Our teacher is an IdealTeacher (Provides All Good Actions) Set of all Good actions All actions are equally good Advice Interface Any action not in O(si) is suboptimal according to Ideal Teacher Ideal Teacher

‘Any Label Learning’ (ALL) • Learning Goal: Find a probabilistic policy, or classifier, that has a high probability of returning an action in O(s) when applied to s. Imagine: Our teacher is an IdealTeacher (Provides All Good Actions) Set of all Good actions All actions are equally good Advice Interface • ALLLikelihood (LALL(,C)) : Any action not in O(si) is suboptimal according to Ideal Teacher Ideal Teacher Probability of selecting an action in O(Si) given state si

Critique Data Loss L(θ,C) • Coming back to reality: Not All Teachers are Ideal ! and provide partial evidence about O(si) Advice Interface • What about the naïve approach of treating as the true set O(si) ? • Difficulties: • When there are actions outside of that are equally good compared to those in , the learning problem becomes even harder. • We want a principled way of handling the situation where either or can be empty.

Expected Any-Label Learning and provide partial evidence about O(si) User Model • Assume independence • among different states. … , we can get: From corresponding for all states Expected ALL Loss: • The Gradient of Expected Losshas a compact closed form.

Experimental Setup Map 1 Map 2 • Our Domain: Micro-management in tactical battles in the Real Time Strategy (RTS) game of Wargus. • 5 friendly footmen against a group of 5 enemy footmen (Wargus AI). • Two battle maps, which differed only in the initial placement of the units. • Both maps had winning strategies for the friendly team and are of roughly the same difficulty.

Advice Interface • Difficulty: • Fast pace and multiple units acting in parallel • Our setup: • Provide end-users with an Advice Interface that allows to watch a battle and pause at any moment.

User Study • Goal is to evaluate two systems • Supervised System = no practice session • Combined System = includes practice and critique • The user study involved 10 end-users • 6 with CS background • 4 no CS background • Each user trained both the supervised and combined systems • 30 minutes total for supervised • 60 minutes for combined due to additional time for practice • Since repeated runs are not practical results are qualitative • To provide statistical results we first present simulated experiments

Simulated Experiments • After user study, selected the worst and best performing users on each map when training the combined system • Total Critique data: User#1: 36, User#2: 91, User#3: 115, User#4: 33. • For each user: divide critique data into 4 equal sized segments creating four data-sets per user containing 25%, 50%, 75%, and 100% of their respective critique data. • We provided the combined system with each of these data sets and allowed it to practice for 100 episodes. All results are averaged over 5 runs.

Simulated Experiments Results: Benefit of Critiques (User #1) • RL is unable to learn a winning policy (i.e. achieve a positive value).

Simulated Experiments Results: Benefit of Critiques (User #1) • With more critiques performance increases a little bit.

Simulated Experiments Results: Benefit of Critiques (User #1) • As the amount of critique data increases, the performance improves for a fixed number of practice episodes. • RL did not go past 12 health difference on any map even after 500 trajectories.

Simulated Experiments Results: Benefit of Practice (User #1) • Even with no practice, the critique data was sufficient to outperform RL. • RL did not go past 12 health difference.

Simulated Experiments Results: Benefit of Practice (User #1) • With more practice performance increases too.

Simulated Experiments Results: Benefit of Practice (User #1) • Our approach is able to leverage practice episodes in order to improve the effectiveness on a given amount of critique data.

Results for Actual User Study • Goal is to evaluate two systems • Supervised System = no practice session • Combined System = includes practice and critique • The user study involved 10 end-users • 6 with CS background • 4 no CS background • Each user trained both the supervised and combined systems • 30 minutes total for supervised • 60 minutes for combined due to additional time for practice

Results of User Study

Results of User Study • Comparing to RL: • 9 out of 10 users achieved 50 or more performance using Supervised System • 6 out of 10 users achieved 50 or more performance using Combined System • RL did not go past 12 health difference on any map even after 500 trajectories. • Users effectively performed better than RL using either the Supervised or Combined Systems.

Results of User Study • Comparing Combined and Supervised: • The end-users had slightly greater success with the supervised system v/s the combined system. • More users were able to achieve performance levels of 50 and 80 using the supervised system. • Frustrating Problems for Users  • Large delay experience. (not an issue in many realistic settings) • Policy returned after practice was sometimes poor, seemed to be ignoring advice. (perhaps practice sessions were too short)

PART III Future Directions

Future Direction 1: Extending RL via Practice and Critique Advice • Understanding the effects of user models: • Study sensitivity of our algorithm to various settings of model parameters. • Robustness of our algorithm against inaccurate parameter settings. • Study benefits of using more elaborate user models. • Understanding the effects of mixing advice from multiple teachers: • Pros: addresses incompleteness and lack of quality of advice from a single teacher. • Cons: introduces variations and more complex patterns that are hard to generalize. • Study benefits and harms of mixing advice. • Understanding the effects of advice types: • Study the effects of feedback only versus mixed advice.

Future Direction 2: Active Learning for Sequential Decision Making • Current advice collection mechanism is very basic: • An entire episode is played before the teacher. • Teacher scans the episode to locate places where critique is needed. • Only one episode is played. • Problems with current advice collection mechanism: • Teacher is fully responsible for locating places where critique is needed. • Scanning an entire episode is very cognitively demanding. • Good possibility of missing places where advice is critical. • Showing only one episode is a limitation, especially in stochastic domains. • GOAL: Learner should itself discover places where it needs advice and query teacher at those places.

Active Learning for Sequential Decision Making: Problem Description Teacher In a state siaction ai is bad, whereas action aj is good. Full Episode: Practice Session Critique Session Current policy Trajectory Data Advice Interface Critique Data Policy parameters 

Active Learning for Sequential Decision Making: Problem Description Teacher In a state siaction ai is bad, whereas action aj is good. Best sequence: Practice Session Critique Session Active Learning Module Trajectory Data Current policy Cost Model ($$) Advice Interface Critique Data Policy parameters  Problem: How to select that best optimizes benefit-to-cost tradeoff?

What about existing techniques? • Few techniques exist for the problem of ‘active learning’ in sequential decision making with an external teacher • All techniques make some assumptions that work only for certain applications • Some techniques request full demonstration from start state • Some techniques assume teacher is always available and request a single or multi-step demonstration when needed • Some techniques removes assumption of teacher being present at all times but they pause till the request for demonstration is satisfied

What about existing techniques? • We feel such assumptions are unnecessary in general • Providing demonstration is quite labor intensive and sometimes not even practical • We instead seek feedback and guidance on potential execution traces of our policy • Pausing and waiting for the teacher is also inefficient • We never want to pause but keep generating execution traces from our policy for teacher to critique later when he/she is available

What about Supervised Active Learning techniques? • Active learning is well developed for supervised setting • All instances come from single distribution of interest • Best instance is selected based on some criterion and queried for its label • In our setting, the distribution of interest is the distribution of states along the teacher's policy (or a good policy) • Asking queries about states that are far off of the teacher's policy is likely to not produce any useful feedback (losing states in Chess or Wargus) • Learner faces additional challenge to identify states that occur along the teacher's policy and query in those states

Kshitij Judah EECS, OSU