Create Presentation
Download Presentation

Download Presentation

Apprenticeship Learning Pieter Abbeel Stanford University

Download Presentation
## Apprenticeship Learning Pieter Abbeel Stanford University

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Apprenticeship Learning**Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov, Sebastian Thrun.**Machine Learning**• Large number of success stories: • Handwritten digit recognition • Face detection • Disease diagnosis • … All learn from examples a direct mapping from inputs to outputs. • Reinforcement learning / Sequential decision making: • Humans still greatly outperform machines.**Reinforcement learning**Probability distribution over next states given current state and action Dynamics Model Psa Describes desirability (how much it costs) to be in a state. Prescribes actions to take Reinforcement Learning Reward Function R Controller p**Apprenticeship learning**Teacher Demonstration Dynamics Model Psa (s0, a0, s1, a1, ….) Reinforcement Learning Reward Function R Controller p**Learning from demonstrations**• Learn direct mapping from states to actions • Assumes controller simplicity. • E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; • Inverse reinforcement learning [Ng & Russell, 2000] • Tries to recover the reward function from demonstrations. • Inherent ambiguity makes reward function impossible to recover. • Apprenticeship learning [Abbeel & Ng, 2004] • Exploits reward function structure + provides strong guarantees. • Related work since: Ratliff et al., 2006, 2007; Neu & Szepesvari, 2007; Syed & Schapire, 2008.**Apprenticeship learning**• Key desirable properties: • Returns controller with performance guarantee: • Short running time. • Small number of demonstrations required.**Apprenticeship learning algorithm**• Assume • Initialize: pick some controller 0. • Iterate for i = 1, 2, … : • Make the current best guess for the reward function. Concretely, find the reward function such that the teacher maximally outperforms all previously found controllers. • Find optimal optimal controller ifor the current guess of the reward function Rw. • If , exit the algorithm.**Highway driving**Input: Driving demonstration Output: Learned behavior The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.**Parking lot navigation**Reward function trades off: curvature, smoothness, distance to obstacles, alignment with principal directions.**Quadruped**• Reward function trades off 25 features. • Learn on training terrain. • Test on previously unseen terrain. [NIPS 2008]**Apprenticeship learning**Teacher’s flight Dynamics Model Psa (s0, a0, s1, a1, ….) Learn R Reinforcement Learning Reward Function R Controller p**Apprenticeship learning**Teacher’s flight Dynamics Model Psa (s0, a0, s1, a1, ….) Learn R Reinforcement Learning Reward Function R Controller p**Motivating example**Collect flight data. • How to fly helicopter for data collection? • How to ensure that entire flight envelope is covered by the data collection process? • Textbook model • Specification • Textbook model • Specification Accurate dynamics model Psa Accurate dynamics model Psa Learn model from data.**Learning the dynamics model**Have good model of dynamics? • State-of-the-art: E3 algorithm, Kearns and Singh (1998,2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) NO YES “Explore” “Exploit”**Learning the dynamics model**Have good model of dynamics? • State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.) Exploration policies are impractical: they do not even try to perform well. NO YES Can we avoid explicit exploration and just exploit? “Explore” “Exploit”**Apprenticeship learning of the model**Autonomous flight Teacher’s flight Learn Psa Dynamics Model Psa Learn Psa (s0, a0, s1, a1, ….) (s0, a0, s1, a1, ….) Reinforcement Learning Reward Function R Controller p**Theoretical guarantees**• Here, polynomial is with respect to 1/, 1/(failure probability), the horizon T, the maximum reward R, the size of the state space.**Model Learning: Proof Idea**• From initial pilot demonstrations, our model/simulator Psawill be accurate for the part of the state space (s,a) visited by the pilot. • Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s controller *. • Consequently, there is at least one controller (namely *) that looks capable of flying the helicopter well in our simulation. • Thus, each time we solve for the optimal controller using the current model/simulator Psa, we will find a controller that successfully flies the helicopter according to Psa. • If, on the actual helicopter, this controller fails to fly the helicopter---despite the model Psa predicting that it should---then it must be visiting parts of the state space that are inaccurately modeled. • Hence, we get useful training data to improve the model. This can happen only a small number of times.**Learning the dynamics model**• Exploiting structure from physics • Explicitly encode gravity, inertia. • Estimate remaining dynamics from data. • Lagged learning criterion • Maximize prediction accuracy of the simulator over time scales relevant for control (vs. digital integration time scale). • Similar to machine learning: discriminative vs. generative. [Abbeel et al. {NIPS 2005, NIPS 2006}]**Related work**• Bagnell & Schneider, 2001; LaCivita et al., 2006; Ng et al., 2004a; Roberts et al., 2003; Saripalli et al., 2003.; Ng et al., 2004b; Gavrilets, Martinos, Mettler and Feron, 2002. • Maneuvers presented here are significantly more difficult than those flown by any other autonomous helicopter.**Apprenticeship learning**Autonomous flight Teacher’s flight Learn Psa Dynamics Model Psa Learn Psa (s0, a0, s1, a1, ….) (s0, a0, s1, a1, ….) Learn R Reinforcement Learning Reward Function R Controller p • Model predictive control • Receding horizon differential dynamic programming**Apprenticeship learning: summary**Autonomous flight Teacher’s flight Learn Psa Learn Psa Dynamics Model Psa Learn Psa Learn Psa (s0, a0, s1, a1, ….) (s0, a0, s1, a1, ….) Learn R Learn R Reinforcement Learning Reward Function R Controller p Applications:**Current and future work**• Applications: • Autonomous helicopters to assist in wildland fire fighting. • Fixed-wing formation flight: Estimated fuel savings for three aircraft formation: 20%. • Learning from demonstrations only scratches the surface of how humans learn (and teach). • Safe autonomous learning. • More general advice taking.**Apprenticeship Learning via Inverse Reinforcement Learning,**Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2004. • Learning First Order Markov Models for Control, Pieter Abbeel and Andrew Y. Ng. In NIPS 17, 2005. • Exploration and Apprenticeship Learning in Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2005. • Modeling Vehicular Dynamics, with Application to Modeling Helicopters, Pieter Abbeel, VarunGanapathi and Andrew Y. Ng. In NIPS 18, 2006. • Using Inaccurate Models in Reinforcement Learning, Pieter Abbeel, Morgan Quigley and Andrew Y. Ng. In Proc. ICML, 2006. • An Application of Reinforcement Learning to Aerobatic Helicopter Flight, Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng. In NIPS 19, 2007. • Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion, J. Zico Kolter, Pieter Abbeel and Andrew Y. Ng. In NIPS 20, 2008.**Current and future work**• Applications: • Autonomous helicopters to assist in wildland fire fighting. • Fixed-wing formation flight: Estimated fuel savings for three aircraft formation: 20%. • Learning from demonstrations only scratches the surface of how humans learn (and teach). • Safe autonomous learning. • More general advice taking.**Full Inverse RL Algorithm**• Initialize: pick some arbitrary reward weights w. • For i = 1, 2, … • RL step: Compute optimal controller i for the current estimate of the reward function Rw. • Inverse RL step: Re-estimate the reward function Rw: If , exit the algorithm.**Apprenticeship learning**Autonomous flight Teacher’s flight Learn Psa Dynamics Model Psa Learn Psa (s0, a0, s1, a1, ….) (s0, a0, s1, a1, ….) Learn R Reinforcement Learning Reward Function R Controller p**Algorithm Idea**• Input to algorithm: approximate model. • Start by computing the optimal controlleraccording to the model. Real-life trajectory Target trajectory**Algorithm Idea (2)**• Update the model such that it becomes exact for the current controller.**Algorithm Idea (2)**• Update the model such that it becomes exact for the current controller.