1 / 35

Kshitij Judah, Alan Fern, Tom Dietterich

Active Imitation Learning via Reduction to I.I.D. Active Learning. Kshitij Judah, Alan Fern, Tom Dietterich. UAI-2012 Catalina Island, CA, USA. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: . GOAL: Teacher wants to teach policy to the learner.

brick
Download Presentation

Kshitij Judah, Alan Fern, Tom Dietterich

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active Imitation Learning via Reduction to I.I.D. Active Learning Kshitij Judah, Alan Fern, Tom Dietterich UAI-2012 Catalina Island, CA, USA TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

  2. GOAL:Teacher wants to teach policy to the learner Imitation Learning Machine Learning Algorithm Policy Trajectories Learner Teacher Flight simulation Car driving Electronic games [Pomerleau, 89] [Sammut et. al., 92] [Ross and Bagnell, 2010]

  3. Passive i.i.d. Learner Reduction to passive i.i.d. learning Error rate of Horizon Labeled data Passive Imitation Learning Learner Teacher Learner Classifier [Ross and Bagnell, 2010, Syed and Schapire, 2010]

  4. Passive i.i.d. Learner Labeled data Passive Imitation Learning Learner Teacher Classifier • DRAWBACK: • Generating such trajectories can be tedious and may even be impractical. Real-time low-level Control of multiple Game agents Controlling a quadruped robot on a rough terrain [Ratliff et. al., 2006] Simulated Snake Control

  5. Active Imitation Learning via Action Queries Teacher correct action to take in is Dynamics Simulator (No rewards) Learner Select Best Action Query Current Training data (s, a) pairs GOAL:Learn using as few queries as possible

  6. # of queries posed # of actions by teacher Key Questions • Empirical Performance: Active vs. Passive ? • Theoretical Label Complexity: Active vs. Passive ? • I.I.D Setting: Active has significant empirical and theoretical advantages Our Approach: “reduce” active imitation learning to i.i.d. active learning

  7. I.I.D. PAC Learning • Goal: Train a classifier such that with probability , the error rate Unlabeled data distribution Labeled data Passive i.i.d. Learner pairs Learner Oracle/Teacher Passive Label Complexity: True label is Unlabeled data distribution Select Best Query Oracle/Teacher i.i.d. active learner Current Training data pairs Active Label Complexity:

  8. Reducing Active Imitation Learning to I.I.D. Active Learning Teacher correct action to take in is i.i.d. active Learner Unlabeled data distribution Select Best Action Query Current Training data (s, a) pairs

  9. Reducing Active Imitation Learning to I.I.D. Active Learning Teacher correct action to take in is • Key Challenge: what unlabeled data distribution do we give to the i.i.d. active learner? i.i.d. active Learner ? Select Best Action Query Current Training data (s, a) pairs

  10. Reducing Active Imitation Learning to I.I.D. Active Learning Teacher correct action to take in is • Ideal Distribution: state distribution of the teacher policy • But this unknown since we don’t know • Chicken and egg problem!! i.i.d. active Learner State distribution according to Select Best Action Query Current Training data (s, a) pairs

  11. Reducing Active Imitation Learning to I.I.D. Active Learning Teacher correct action to take in is • Naïve Approach: use an arbitrary distribution • Learning on arbitrary distribution often leads to bad performance i.i.d. active Learner Arbitrary state distribution Select Best Action Query Current Training data (s, a) pairs

  12. Reducing Active Imitation Learning to I.I.D. Active Learning Teacher correct action to take in is ? • Naïve Approach: use an arbitrary distribution • Learning on arbitrary distribution often leads to bad performance • Wasted queries i.i.d. active Learner Arbitrary state distribution Select Best Action Query Current Training data (s, a) pairs

  13. Reducing Active Imitation Learning to I.I.D. Active Learning Teacher correct action to take in is • Key Challenge: what unlabeled data distribution do we give to the i.i.d. active learner? i.i.d. active Learner ? Select Best Action Query Current Training data (s, a) pairs

  14. Reducing Active Imitation Learning to I.I.D. Active Learning • Non-stationary policies: • Active Forward Training • Based on forward training algorithm of Ross and Bagnell (2010) • Stationary policies: • RAIL: “Reduction-based Active Imitation Learning” • Both reductions achieve exponentially better label complexity than any known passive imitation learning algorithm

  15. Unlabeled states Correct action to take in is Select Best Action Query Current Training data (s, a) pairs RAIL Algorithm Teacher Iteration t =1 Extract unlabeled states Generate trajectories Trajectories i.i.d. active Learner k queries to teacher

  16. RAIL Algorithm Unlabeled states Correct action to take in is Select Best Action Query i.i.d. active Learner Current Training data (s, a) pairs k queries to teacher Teacher Iteration t =2 Extract unlabeled states Generate trajectories Trajectories

  17. RAIL Algorithm Teacher Iteration t =H Extract unlabeled states Generate trajectories Trajectories Unlabeled states Correct action to take in is Select Best Action Query i.i.d. active Learner is returned as the final learned policy! Current Training data (s, a) pairs k queries to teacher

  18. RAIL Algorithm Teacher Iteration t =1 Extract unlabeled states Generate trajectories Trajectories Unlabeled states Correct action to take in is Select Best Action Query accurate classifier i.i.d. active Learner Current Training data (s, a) pairs k queries to teacher

  19. RAIL Algorithm Teacher Iteration t =2 Extract unlabeled states Generate trajectories Trajectories Unlabeled states Correct action to take in is Select Best Action Query accurate classifier i.i.d. active Learner Current Training data (s, a) pairs k queries to teacher

  20. RAIL Algorithm Teacher Iteration t =H Extract unlabeled states Generate trajectories Trajectories Unlabeled states Correct action to take in is Select Best Action Query i.i.d. active Learner is returned as the final learned policy! Current Training data (s, a) pairs k queries to teacher

  21. RAIL Algorithm: Main Result Assume active i.i.d. PAC learner La w/ parameters and Theorem. If La is run with parameters and /H then with probability at least the learned policy satisfies: Corresponding bound for passive imitation learning [Ross and Bagnell, 2010] What are the label complexities of RAIL and passive imitation learning in order to achieve a regret of ?

  22. Label Complexity: RAIL versus Passive Imitation Learning • In the realizable setting we get exponentially better label complexity: • RAIL vs Passive [Ross and Bagnell, 2010]

  23. Practical RAIL: RAIL-DW Teacher Iteration t = Extract unlabeled states Generate trajectories Trajectories Unlabeled states Correct action to take in is Select Best Action Query i.i.d. active Learner Discarded after learning Current Training data (s, a) pairs k queries to teacher

  24. Practical RAIL: RAIL-DW Teacher Iteration t = Extract unlabeled states Generate trajectories Trajectories Unlabeled states Correct action to take in is Select Best Action Query Density-weighted QBC [McCallum & Nigam, 1998] Training data from all previous iterations k=1 queries to teacher

  25. Experiments • We performed experiments in the following domains: • Bicycle Balancing • Wargus • Cart pole • Structured prediction • We compared RAIL-DW against following baselines : • unif-RAND: • Selects states to query uniformly at random • unif-QBC: • Treats all states as i.i.d. and applies standard QBC • Passive imitation learning (Passive): • Simulates standard passive imitation learning • Confidence based autonomy (CBA) [Chernova & Veloso, 2009]: • Query the teacher if confidence < threshold (automatically computed) • Performance can be quite sensitive to threshold adjustment

  26. Bicycle Balancing

  27. Bicycle Balancing: Results

  28. Bicycle Balancing: Results

  29. Bicycle Balancing: Results

  30. Bicycle Balancing: Results

  31. Bicycle Balancing: Results

  32. Wargus

  33. Wargus: Results • We did not run experiments for unif-QBC and unif-RAND due to difficulty of defining space of feasible states over which to sample uniformly

  34. Summary and Future work • Theoretical label complexity: Active better than passive • Empirical performance: Active better than passive • Future work: • Active querying in RL setting • Theoretical analysis of RAIL-DW • Multiple query types

  35. Questions ?

More Related