1 / 59

Kshitij Judah EECS, OSU

Developing Learning Systems that Interact and Learn from Human Teachers. Kshitij Judah EECS, OSU. Dissertation Proposal Presentation. Outline. PART I: User-Initiated Learning PART II: RL via Practice and Critique Advice PART III: Proposed future directions for the PhD program

harlan
Download Presentation

Kshitij Judah EECS, OSU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing Learning Systems that Interact and Learn from Human Teachers Kshitij Judah EECS, OSU Dissertation Proposal Presentation

  2. Outline • PART I: User-Initiated Learning • PART II: RL via Practice and Critique Advice • PART III: Proposed future directions for the PhD program • Extending RL via practice and critique advice • Active Learning for sequential decision making

  3. PART I User-Initiated Learning

  4. User-Initiated Learning (UIL) • All of CALO’s learning components can perform Learning In The Wild (LITW) • But the learning tasks are all pre-defined by CALO’s engineers: • What to learn • What information is relevant for learning • How to acquire training examples • How to apply the learned knowledge • UIL Goal: Make it possible for the user to define new learning tasks after the system is deployed

  5. Motivating Scenario: Forgetting to Set Sensitivity Scientist T I M E L I N E Collaborates on a Classified project Research Team Sends email to team Sets sensitivity to confidential “Lunch today?” Sends email to a colleague Does not set sensitivity to confidential Sends email to team Forgets to set sensitivity to confidential

  6. Motivating Scenario: Forgetting to Set Sensitivity Scientist “Please do not forget to set sensitivity when sending email” T I M E L I N E Research Team Teaches CALO to learn to predict whether user has forgot to set sensitivity Sends email to team CALO reminds user to set sensitivity

  7. User-CALO Interaction: Teaching CALO to Predict Sensitivity Procedure Demonstration and Learning Task Creation Integrated Task Learning Instrumented Outlook Events Compose new email Modify Procedure user user SPARK Procedure Feature Guidance Learning User Interface for Feature Guidance SAT Based Reasoning System Machine Learner Legal Features User Selected Features Trained Classifier Class Labels Training Examples Feature Guidance Email + Related Objects Knowledge Base SAT Based Reasoning System user CALO Ontology

  8. User-CALO Interaction: Teaching CALO to Predict Sensitivity Procedure Demonstration and Learning Task Creation Integrated Task Learning Instrumented Outlook Events Compose new email Modify Procedure user user SPARK Procedure Feature Guidance Learning User Interface for Feature Guidance SAT Based Reasoning System Machine Learner Legal Features User Selected Features Trained Classifier Class Labels Training Examples Feature Guidance Email + Related Objects Knowledge Base SAT Based Reasoning System user CALO Ontology

  9. User-CALO Interaction: Teaching CALO to Predict Sensitivity Procedure Demonstration and Learning Task Creation Integrated Task Learning Instrumented Outlook Events Compose new email Modify Procedure user user SPARK Procedure Feature Guidance Learning User Interface for Feature Guidance SAT Based Reasoning System Machine Learner Legal Features User Selected Features Trained Classifier Class Labels Training Examples Feature Guidance Email + Related Objects Knowledge Base SAT Based Reasoning System user CALO Ontology

  10. User-CALO Interaction: Teaching CALO to Predict Sensitivity Procedure Demonstration and Learning Task Creation Integrated Task Learning Instrumented Outlook Events Compose new email Modify Procedure user user SPARK Procedure Feature Guidance Learning User Interface for Feature Guidance SAT Based Reasoning System Machine Learner Legal Features User Selected Features Trained Classifier Class Labels Training Examples Feature Guidance Email + Related Objects Knowledge Base SAT Based Reasoning System user CALO Ontology

  11. Assisting the User: Reminding

  12. The Learning Component • Logistic Regression is used as the core learning algorithm • Features • Relational features extracted from ontology • Incorporate User Advice on Features • Apply large prior variance on user selected features • Select prior variance on rest of the features through cross-validation • Automated Model Selection • Parameters: Prior variance on weights, classification threshold • Technique: Maximization of leave-one-out cross-validation estimate of kappa ()

  13. Empirical Evaluation • Problems: • Attachment Prediction • Importance Prediction • Learning Configurations Compared: • No User Advice + Fixed Model Parameters • User Advice + Fixed Model Parameters • No User Advice + Automatic parameter Tuning • User Advice + Automatic parameter Tuning • User Advice: 18 keywords in the body text for each problem

  14. Empirical Evaluation: Data Set • Set of 340 emails obtained from a real desktop user • 256 training set + 84 test set • For each training set size, compute mean kappa () using test set to generate learning curves •  is a statistical measure of inter-rater agreement for discrete classes •  is a common evaluation metric in cases when the classes have a skewed distribution

  15. Empirical Evaluation: Learning Curves Attachment Prediction

  16. Empirical Evaluation: Learning Curves Importance Prediction

  17. Empirical Evaluation: Robustness to Bad Advice • We intended to test the robustness of the system to bad advice • Bad advice was generated as follows: • Use SVM based feature selection in WEKA to produce a ranking of user provided keywords • Replace top three words in the ranking with randomly selected words from the vocabulary

  18. Empirical Evaluation: Robustness to Bad Advice Attachment Prediction

  19. Empirical Evaluation: Robustness to Bad Advice Importance Prediction

  20. Lessons Learned • User interfaces should support rich instrumentation, automation, and intervention • User interfaces should come with models of their behavior • User advice is helpful but not critical • Self-tuning learning algorithms are critical for success

  21. PART II Reinforcement Learning via Practice and Critique Advice TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

  22. Teacher advice behavior Reinforcement Learning (RL) • PROBLEM: • Usually RL takes a long time to learn a good policy. • GOALS: • Non-technical users as teachers • Natural interaction methods Environment state action reward RESEARCH QUESTION:Can we make RL perform better with some outside help, such as critique/advice from teacher and how?

  23. RL via Practice + Critique Advice In a state siaction ai is bad, whereas action aj is good. Teacher Practice Session Critique Session Advice Interface Trajectory Data Critique Data ? Policy Parameters 

  24. Solution Approach In a state siaction ai is bad, whereas action aj is good. Teacher Practice Session Critique Session Advice Interface Trajectory Data Critique Data Estimate Expected Utility using Importance Sampling. (Peshkin & Shelton, ICML 2002) Policy Parameters 

  25. Critique Data Loss L(θ,C) Some good actions Some bad actions Advice Interface Some actions unlabeled Imagine: Our teacher is an IdealTeacher (Provides All Good Actions) Set of all Good actions All actions are equally good Advice Interface Any action not in O(si) is suboptimal according to Ideal Teacher Ideal Teacher

  26. ‘Any Label Learning’ (ALL) • Learning Goal: Find a probabilistic policy, or classifier, that has a high probability of returning an action in O(s) when applied to s. Imagine: Our teacher is an IdealTeacher (Provides All Good Actions) Set of all Good actions All actions are equally good Advice Interface • ALLLikelihood (LALL(,C)) : Any action not in O(si) is suboptimal according to Ideal Teacher Ideal Teacher Probability of selecting an action in O(Si) given state si

  27. Critique Data Loss L(θ,C) • Coming back to reality: Not All Teachers are Ideal ! and provide partial evidence about O(si) Advice Interface • What about the naïve approach of treating as the true set O(si) ? • Difficulties: • When there are actions outside of that are equally good compared to those in , the learning problem becomes even harder. • We want a principled way of handling the situation where either or can be empty.

  28. Expected Any-Label Learning and provide partial evidence about O(si) User Model • Assume independence • among different states. … , we can get: From corresponding for all states Expected ALL Loss: • The Gradient of Expected Losshas a compact closed form.

  29. Experimental Setup Map 1 Map 2 • Our Domain: Micro-management in tactical battles in the Real Time Strategy (RTS) game of Wargus. • 5 friendly footmen against a group of 5 enemy footmen (Wargus AI). • Two battle maps, which differed only in the initial placement of the units. • Both maps had winning strategies for the friendly team and are of roughly the same difficulty.

  30. Advice Interface • Difficulty: • Fast pace and multiple units acting in parallel • Our setup: • Provide end-users with an Advice Interface that allows to watch a battle and pause at any moment.

  31. User Study • Goal is to evaluate two systems • Supervised System = no practice session • Combined System = includes practice and critique • The user study involved 10 end-users • 6 with CS background • 4 no CS background • Each user trained both the supervised and combined systems • 30 minutes total for supervised • 60 minutes for combined due to additional time for practice • Since repeated runs are not practical results are qualitative • To provide statistical results we first present simulated experiments

  32. Simulated Experiments • After user study, selected the worst and best performing users on each map when training the combined system • Total Critique data: User#1: 36, User#2: 91, User#3: 115, User#4: 33. • For each user: divide critique data into 4 equal sized segments creating four data-sets per user containing 25%, 50%, 75%, and 100% of their respective critique data. • We provided the combined system with each of these data sets and allowed it to practice for 100 episodes. All results are averaged over 5 runs.

  33. Simulated Experiments Results: Benefit of Critiques (User #1) • RL is unable to learn a winning policy (i.e. achieve a positive value).

  34. Simulated Experiments Results: Benefit of Critiques (User #1) • With more critiques performance increases a little bit.

  35. Simulated Experiments Results: Benefit of Critiques (User #1) • As the amount of critique data increases, the performance improves for a fixed number of practice episodes. • RL did not go past 12 health difference on any map even after 500 trajectories.

  36. Simulated Experiments Results: Benefit of Practice (User #1) • Even with no practice, the critique data was sufficient to outperform RL. • RL did not go past 12 health difference.

  37. Simulated Experiments Results: Benefit of Practice (User #1) • With more practice performance increases too.

  38. Simulated Experiments Results: Benefit of Practice (User #1) • Our approach is able to leverage practice episodes in order to improve the effectiveness on a given amount of critique data.

  39. Results for Actual User Study • Goal is to evaluate two systems • Supervised System = no practice session • Combined System = includes practice and critique • The user study involved 10 end-users • 6 with CS background • 4 no CS background • Each user trained both the supervised and combined systems • 30 minutes total for supervised • 60 minutes for combined due to additional time for practice

  40. Results of User Study

  41. Results of User Study • Comparing to RL: • 9 out of 10 users achieved 50 or more performance using Supervised System • 6 out of 10 users achieved 50 or more performance using Combined System • RL did not go past 12 health difference on any map even after 500 trajectories. • Users effectively performed better than RL using either the Supervised or Combined Systems.

  42. Results of User Study • Comparing Combined and Supervised: • The end-users had slightly greater success with the supervised system v/s the combined system. • More users were able to achieve performance levels of 50 and 80 using the supervised system. • Frustrating Problems for Users  • Large delay experience. (not an issue in many realistic settings) • Policy returned after practice was sometimes poor, seemed to be ignoring advice. (perhaps practice sessions were too short)

  43. PART III Future Directions

  44. Future Direction 1: Extending RL via Practice and Critique Advice • Understanding the effects of user models: • Study sensitivity of our algorithm to various settings of model parameters. • Robustness of our algorithm against inaccurate parameter settings. • Study benefits of using more elaborate user models. • Understanding the effects of mixing advice from multiple teachers: • Pros: addresses incompleteness and lack of quality of advice from a single teacher. • Cons: introduces variations and more complex patterns that are hard to generalize. • Study benefits and harms of mixing advice. • Understanding the effects of advice types: • Study the effects of feedback only versus mixed advice.

  45. Future Direction 2: Active Learning for Sequential Decision Making • Current advice collection mechanism is very basic: • An entire episode is played before the teacher. • Teacher scans the episode to locate places where critique is needed. • Only one episode is played. • Problems with current advice collection mechanism: • Teacher is fully responsible for locating places where critique is needed. • Scanning an entire episode is very cognitively demanding. • Good possibility of missing places where advice is critical. • Showing only one episode is a limitation, especially in stochastic domains. • GOAL: Learner should itself discover places where it needs advice and query teacher at those places.

  46. Active Learning for Sequential Decision Making: Problem Description Teacher In a state siaction ai is bad, whereas action aj is good. Full Episode: Practice Session Critique Session Current policy Trajectory Data Advice Interface Critique Data Policy parameters 

  47. Active Learning for Sequential Decision Making: Problem Description Teacher In a state siaction ai is bad, whereas action aj is good. Best sequence: Practice Session Critique Session Active Learning Module Trajectory Data Current policy Cost Model ($$) Advice Interface Critique Data Policy parameters  Problem: How to select that best optimizes benefit-to-cost tradeoff?

  48. What about existing techniques? • Few techniques exist for the problem of ‘active learning’ in sequential decision making with an external teacher • All techniques make some assumptions that work only for certain applications • Some techniques request full demonstration from start state • Some techniques assume teacher is always available and request a single or multi-step demonstration when needed • Some techniques removes assumption of teacher being present at all times but they pause till the request for demonstration is satisfied

  49. What about existing techniques? • We feel such assumptions are unnecessary in general • Providing demonstration is quite labor intensive and sometimes not even practical • We instead seek feedback and guidance on potential execution traces of our policy • Pausing and waiting for the teacher is also inefficient • We never want to pause but keep generating execution traces from our policy for teacher to critique later when he/she is available

  50. What about Supervised Active Learning techniques? • Active learning is well developed for supervised setting • All instances come from single distribution of interest • Best instance is selected based on some criterion and queried for its label • In our setting, the distribution of interest is the distribution of states along the teacher's policy (or a good policy) • Asking queries about states that are far off of the teacher's policy is likely to not produce any useful feedback (losing states in Chess or Wargus) • Learner faces additional challenge to identify states that occur along the teacher's policy and query in those states

More Related