Apprenticeship Learning Pieter Abbeel Stanford University

Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov, Sebastian Thrun.

Machine Learning • Large number of success stories: • Handwritten digit recognition • Face detection • Disease diagnosis • … All learn from examples a direct mapping from inputs to outputs. • Reinforcement learning / Sequential decision making: • Humans still greatly outperform machines.

Reinforcement learning Probability distribution over next states given current state and action Dynamics Model Psa Describes desirability (how much it costs) to be in a state. Prescribes actions to take Reinforcement Learning Reward Function R Controller p

Apprenticeship learning Teacher Demonstration Dynamics Model Psa (s0, a0, s1, a1, ….) Reinforcement Learning Reward Function R Controller p

Example task: driving

Learning from demonstrations • Learn direct mapping from states to actions • Assumes controller simplicity. • E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; • Inverse reinforcement learning [Ng & Russell, 2000] • Tries to recover the reward function from demonstrations. • Inherent ambiguity makes reward function impossible to recover. • Apprenticeship learning [Abbeel & Ng, 2004] • Exploits reward function structure + provides strong guarantees. • Related work since: Ratliff et al., 2006, 2007; Neu & Szepesvari, 2007; Syed & Schapire, 2008.

Apprenticeship learning • Key desirable properties: • Returns controller  with performance guarantee: • Short running time. • Small number of demonstrations required.

Apprenticeship learning algorithm • Assume • Initialize: pick some controller 0. • Iterate for i = 1, 2, … : • Make the current best guess for the reward function. Concretely, find the reward function such that the teacher maximally outperforms all previously found controllers. • Find optimal optimal controller ifor the current guess of the reward function Rw. • If , exit the algorithm.

Theoretical guarantees

Highway driving Input: Driving demonstration Output: Learned behavior The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.

Parking lot navigation Reward function trades off: curvature, smoothness, distance to obstacles, alignment with principal directions.

Quadruped • Reward function trades off 25 features. • Learn on training terrain. • Test on previously unseen terrain. [NIPS 2008]

Quadruped on test-board

Apprenticeship learning Teacher’s flight Dynamics Model Psa (s0, a0, s1, a1, ….) Learn R Reinforcement Learning Reward Function R Controller p

Motivating example Collect flight data. • How to fly helicopter for data collection? • How to ensure that entire flight envelope is covered by the data collection process? • Textbook model • Specification • Textbook model • Specification Accurate dynamics model Psa Accurate dynamics model Psa Learn model from data.

Learning the dynamics model Have good model of dynamics? • State-of-the-art: E3 algorithm, Kearns and Singh (1998,2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) NO YES “Explore” “Exploit”

Learning the dynamics model Have good model of dynamics? • State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) Exploration policies are impractical: they do not even try to perform well. NO YES Can we avoid explicit exploration and just exploit? “Explore” “Exploit”

Apprenticeship learning of the model Autonomous flight Teacher’s flight Learn Psa Dynamics Model Psa Learn Psa (s0, a0, s1, a1, ….) (s0, a0, s1, a1, ….) Reinforcement Learning Reward Function R Controller p

Theoretical guarantees • Here, polynomial is with respect to 1/, 1/(failure probability), the horizon T, the maximum reward R, the size of the state space.

Model Learning: Proof Idea • From initial pilot demonstrations, our model/simulator Psawill be accurate for the part of the state space (s,a) visited by the pilot. • Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s controller *. • Consequently, there is at least one controller (namely *) that looks capable of flying the helicopter well in our simulation. • Thus, each time we solve for the optimal controller using the current model/simulator Psa, we will find a controller that successfully flies the helicopter according to Psa. • If, on the actual helicopter, this controller fails to fly the helicopter---despite the model Psa predicting that it should---then it must be visiting parts of the state space that are inaccurately modeled. • Hence, we get useful training data to improve the model. This can happen only a small number of times.

Learning the dynamics model • Exploiting structure from physics • Explicitly encode gravity, inertia. • Estimate remaining dynamics from data. • Lagged learning criterion • Maximize prediction accuracy of the simulator over time scales relevant for control (vs. digital integration time scale). • Similar to machine learning: discriminative vs. generative. [Abbeel et al. {NIPS 2005, NIPS 2006}]

Autonomous nose-in funnel

Related work • Bagnell & Schneider, 2001; LaCivita et al., 2006; Ng et al., 2004a; Roberts et al., 2003; Saripalli et al., 2003.; Ng et al., 2004b; Gavrilets, Martinos, Mettler and Feron, 2002. • Maneuvers presented here are significantly more difficult than those flown by any other autonomous helicopter.

Apprenticeship learning Autonomous flight Teacher’s flight Learn Psa Dynamics Model Psa Learn Psa (s0, a0, s1, a1, ….) (s0, a0, s1, a1, ….) Learn R Reinforcement Learning Reward Function R Controller p • Model predictive control • Receding horizon differential dynamic programming

Apprenticeship learning: summary Autonomous flight Teacher’s flight Learn Psa Learn Psa Dynamics Model Psa Learn Psa Learn Psa (s0, a0, s1, a1, ….) (s0, a0, s1, a1, ….) Learn R Learn R Reinforcement Learning Reward Function R Controller p Applications:

Demonstrations

Learned reward (trajectory)

Current and future work • Applications: • Autonomous helicopters to assist in wildland fire fighting. • Fixed-wing formation flight: Estimated fuel savings for three aircraft formation: 20%. • Learning from demonstrations only scratches the surface of how humans learn (and teach). • Safe autonomous learning. • More general advice taking.

Thank you.

Apprenticeship Learning via Inverse Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2004. • Learning First Order Markov Models for Control, Pieter Abbeel and Andrew Y. Ng. In NIPS 17, 2005. • Exploration and Apprenticeship Learning in Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2005. • Modeling Vehicular Dynamics, with Application to Modeling Helicopters, Pieter Abbeel, VarunGanapathi and Andrew Y. Ng. In NIPS 18, 2006. • Using Inaccurate Models in Reinforcement Learning, Pieter Abbeel, Morgan Quigley and Andrew Y. Ng. In Proc. ICML, 2006. • An Application of Reinforcement Learning to Aerobatic Helicopter Flight, Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng. In NIPS 19, 2007. • Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion, J. Zico Kolter, Pieter Abbeel and Andrew Y. Ng. In NIPS 20, 2008.

Airshow accuracy

Chaos

Tic-toc

Current and future work • Applications: • Autonomous helicopters to assist in wildland fire fighting. • Fixed-wing formation flight: Estimated fuel savings for three aircraft formation: 20%. • Learning from demonstrations only scratches the surface of how humans learn (and teach). • Safe autonomous learning. • More general advice taking.

Full Inverse RL Algorithm • Initialize: pick some arbitrary reward weights w. • For i = 1, 2, … • RL step: Compute optimal controller i for the current estimate of the reward function Rw. • Inverse RL step: Re-estimate the reward function Rw: If , exit the algorithm.

Helicopter dynamics model in auto

Parking lot navigation---experiments

Helicopter inverse RL: experiments

Auto-rotation descent

Apprenticeship learning Autonomous flight Teacher’s flight Learn Psa Dynamics Model Psa Learn Psa (s0, a0, s1, a1, ….) (s0, a0, s1, a1, ….) Learn R Reinforcement Learning Reward Function R Controller p

Algorithm Idea • Input to algorithm: approximate model. • Start by computing the optimal controlleraccording to the model. Real-life trajectory Target trajectory

Algorithm Idea (2) • Update the model such that it becomes exact for the current controller.

Algorithm Idea (2)

Performance Guarantees

Apprenticeship Learning Pieter Abbeel Stanford University