slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Apprenticeship Learning Pieter Abbeel Stanford University PowerPoint Presentation
Download Presentation
Apprenticeship Learning Pieter Abbeel Stanford University

Loading in 2 Seconds...

play fullscreen
1 / 83

Apprenticeship Learning Pieter Abbeel Stanford University - PowerPoint PPT Presentation


  • 146 Views
  • Uploaded on

Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov , Sebastian Thrun. Machine Learning. Large number of success stories: Handwritten digit recognition Face detection Disease diagnosis

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Apprenticeship Learning Pieter Abbeel Stanford University' - beatrice-west


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Apprenticeship Learning

Pieter Abbeel

Stanford University

In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov, Sebastian Thrun.

machine learning
Machine Learning
  • Large number of success stories:
    • Handwritten digit recognition
    • Face detection
    • Disease diagnosis

All learn from examples a direct mapping from inputs to outputs.

  • Reinforcement learning / Sequential decision making:
    • Humans still greatly outperform machines.
reinforcement learning
Reinforcement learning

Probability distribution over next states given current state and action

Dynamics Model Psa

Describes desirability (how much it costs) to be in a state.

Prescribes actions to take

Reinforcement

Learning

Reward Function R

Controller p

apprenticeship learning
Apprenticeship learning

Teacher Demonstration

Dynamics Model Psa

(s0, a0, s1, a1, ….)

Reinforcement

Learning

Reward Function R

Controller p

learning from demonstrations
Learning from demonstrations
  • Learn direct mapping from states to actions
    • Assumes controller simplicity.
    • E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002;
  • Inverse reinforcement learning [Ng & Russell, 2000]
    • Tries to recover the reward function from demonstrations.
    • Inherent ambiguity makes reward function impossible to recover.
  • Apprenticeship learning [Abbeel & Ng, 2004]
    • Exploits reward function structure + provides strong guarantees.
    • Related work since: Ratliff et al., 2006, 2007; Neu & Szepesvari, 2007; Syed & Schapire, 2008.
apprenticeship learning1
Apprenticeship learning
  • Key desirable properties:
    • Returns controller  with performance guarantee:
    • Short running time.
    • Small number of demonstrations required.
apprenticeship learning a lgorithm
Apprenticeship learning algorithm
  • Assume
  • Initialize: pick some controller 0.
  • Iterate for i = 1, 2, … :
    • Make the current best guess for the reward function. Concretely, find the reward function such that the teacher maximally outperforms all previously found controllers.
    • Find optimal optimal controller ifor the current guess of the reward function Rw.
    • If , exit the algorithm.
highway driving
Highway driving

Input: Driving demonstration

Output: Learned behavior

The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.

parking lot navigation
Parking lot navigation

Reward function trades off: curvature, smoothness,

distance to obstacles, alignment with principal directions.

quadruped
Quadruped
  • Reward function trades off 25 features.
  • Learn on training terrain.
  • Test on previously unseen terrain.

[NIPS 2008]

apprenticeship learning2
Apprenticeship learning

Teacher’s flight

Dynamics Model Psa

(s0, a0, s1, a1, ….)

Learn

R

Reinforcement

Learning

Reward Function R

Controller p

apprenticeship learning3
Apprenticeship learning

Teacher’s flight

Dynamics Model Psa

(s0, a0, s1, a1, ….)

Learn

R

Reinforcement

Learning

Reward Function R

Controller p

motivating example
Motivating example

Collect flight data.

  • How to fly helicopter for data collection?
  • How to ensure that entire flight envelope is covered by the data collection process?
  • Textbook model
  • Specification
  • Textbook model
  • Specification

Accurate dynamics model Psa

Accurate dynamics model Psa

Learn model from data.

learning the dynamics model
Learning the dynamics model

Have good

model of

dynamics?

  • State-of-the-art: E3 algorithm, Kearns and Singh (1998,2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)

NO

YES

“Explore”

“Exploit”

learning the dynamics model1
Learning the dynamics model

Have good

model of

dynamics?

  • State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)

Exploration policies are impractical: they do not even try to perform well.

NO

YES

Can we avoid explicit exploration and just exploit?

“Explore”

“Exploit”

apprenticeship learning of the model
Apprenticeship learning of the model

Autonomous flight

Teacher’s flight

Learn Psa

Dynamics Model Psa

Learn Psa

(s0, a0, s1, a1, ….)

(s0, a0, s1, a1, ….)

Reinforcement

Learning

Reward Function R

Controller p

theoretical guarantees1
Theoretical guarantees
  • Here, polynomial is with respect to

1/, 1/(failure probability), the horizon T, the maximum reward R,

the size of the state space.

model learning proof i dea
Model Learning: Proof Idea
  • From initial pilot demonstrations, our model/simulator Psawill be accurate for the part of the state space (s,a) visited by the pilot.
  • Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s controller *.
  • Consequently, there is at least one controller (namely *) that looks capable of flying the helicopter well in our simulation.
  • Thus, each time we solve for the optimal controller using the current model/simulator Psa, we will find a controller that successfully flies the helicopter according to Psa.
  • If, on the actual helicopter, this controller fails to fly the helicopter---despite the model Psa predicting that it should---then it must be visiting parts of the state space that are inaccurately modeled.
  • Hence, we get useful training data to improve the model. This can happen only a small number of times.
learning the dynamics model2
Learning the dynamics model
  • Exploiting structure from physics
    • Explicitly encode gravity, inertia.
    • Estimate remaining dynamics from data.
  • Lagged learning criterion
    • Maximize prediction accuracy of the simulator over time scales relevant for control (vs. digital integration time scale).
    • Similar to machine learning: discriminative vs. generative.

[Abbeel et al. {NIPS 2005, NIPS 2006}]

related work
Related work
  • Bagnell & Schneider, 2001; LaCivita et al., 2006; Ng et al., 2004a; Roberts et al., 2003; Saripalli et al., 2003.; Ng et al., 2004b; Gavrilets, Martinos, Mettler and Feron, 2002.
  • Maneuvers presented here are significantly more difficult than those flown by any other autonomous helicopter.
apprenticeship learning4
Apprenticeship learning

Autonomous flight

Teacher’s flight

Learn Psa

Dynamics Model Psa

Learn Psa

(s0, a0, s1, a1, ….)

(s0, a0, s1, a1, ….)

Learn

R

Reinforcement

Learning

Reward Function R

Controller p

  • Model predictive control
  • Receding horizon differential dynamic programming
apprenticeship learning summary
Apprenticeship learning: summary

Autonomous flight

Teacher’s flight

Learn Psa

Learn Psa

Dynamics Model Psa

Learn Psa

Learn Psa

(s0, a0, s1, a1, ….)

(s0, a0, s1, a1, ….)

Learn

R

Learn

R

Reinforcement

Learning

Reward Function R

Controller p

Applications:

current and future work
Current and future work
  • Applications:
    • Autonomous helicopters to assist in wildland fire fighting.
    • Fixed-wing formation flight: Estimated fuel savings for three aircraft formation: 20%.
  • Learning from demonstrations only scratches the surface of how humans learn (and teach).
    • Safe autonomous learning.
    • More general advice taking.
slide33

Apprenticeship Learning via Inverse Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2004.

  • Learning First Order Markov Models for Control, Pieter Abbeel and Andrew Y. Ng. In NIPS 17, 2005.
  • Exploration and Apprenticeship Learning in Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2005.
  • Modeling Vehicular Dynamics, with Application to Modeling Helicopters, Pieter Abbeel, VarunGanapathi and Andrew Y. Ng. In NIPS 18, 2006.
  • Using Inaccurate Models in Reinforcement Learning, Pieter Abbeel, Morgan Quigley and Andrew Y. Ng. In Proc. ICML, 2006.
  • An Application of Reinforcement Learning to Aerobatic Helicopter Flight, Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng. In NIPS 19, 2007.
  • Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion, J. Zico Kolter, Pieter Abbeel and Andrew Y. Ng. In NIPS 20, 2008.
current and future work1
Current and future work
  • Applications:
    • Autonomous helicopters to assist in wildland fire fighting.
    • Fixed-wing formation flight: Estimated fuel savings for three aircraft formation: 20%.
  • Learning from demonstrations only scratches the surface of how humans learn (and teach).
    • Safe autonomous learning.
    • More general advice taking.
full inverse rl algorithm
Full Inverse RL Algorithm
  • Initialize: pick some arbitrary reward weights w.
  • For i = 1, 2, …
    • RL step:

Compute optimal controller i for the current estimate of the reward function Rw.

    • Inverse RL step:

Re-estimate the reward function Rw:

If , exit the algorithm.

apprenticeship learning5
Apprenticeship learning

Autonomous flight

Teacher’s flight

Learn Psa

Dynamics Model Psa

Learn Psa

(s0, a0, s1, a1, ….)

(s0, a0, s1, a1, ….)

Learn

R

Reinforcement

Learning

Reward Function R

Controller p

algorithm idea
Algorithm Idea
  • Input to algorithm: approximate model.
  • Start by computing the optimal controlleraccording to the model.

Real-life trajectory

Target trajectory

algorithm idea 2
Algorithm Idea (2)
  • Update the model such that it becomes exact for the current controller.
algorithm idea 21
Algorithm Idea (2)
  • Update the model such that it becomes exact for the current controller.
slide51

First trial.

(Model-based controller.)

After learning. (10 iterations)

performance guarantee intuition
Performance guarantee intuition
  • Intuition by example:
    • Let
    • If the returned controller  satisfies
    • Then no matter what the values of and are, the controller  performs as well as the teacher’s controller *.
summary
Summary

Autonomous flight

Teacher: human pilot flight

Dynamics Model

Psa

Learn Psa

Learn Psa

  • When given a demonstration:
    • Automatically learn reward function, rather than (time-consumingly) hand-engineer it.
    • Unlike exploration methods, our algorithm concentrates on the task of interest, and always tries to fly as well as possible.
    • High performance control with crude model + small number of trials.

(a1, s1, a2, s2, a3, s3, ….)

(a1, s1, a2, s2, a3, s3, ….)

Improve

Learn

R

Reinforcement

Learning

Reward Function R

Controller p

reward intended trajectory
Reward: Intended trajectory
  • Perfect demonstrations are extremely hard to obtain.
  • Multiple trajectory demonstrations:
    • Every demonstration is a noisy instantiation of the intended trajectory.
    • Noise model captures (among others):
      • Position drift.
      • Time warping.
  • If different demonstrations are suboptimal in different ways, they can capture the “intended” trajectory implicitly.
  • [Related work: Atkeson & Schaal, 1997.]
outline
Outline
  • Preliminaries: reinforcement learning.
  • Apprenticeship learning algorithms.
  • Experimental results on various robotic platforms.
reinforcement learning rl
Reinforcement learning (RL)

System

Dynamics

Psa

System

dynamics

Psa

System

Dynamics

Psa

state s0

sT

s1

s2

sT-1

a0

aT-1

a1

reward R(s0)

+

R(s1)

+

R(s2)

+…+

R(sT-1)

+

R(sT)

Goal: Pick actions over time so as to maximize the expected score: E[R(s0) + R(s1) + … + R(sT)]

Solution:controller  which specifies an action for each possible state for all times t= 0, 1, … , T-1.

model based reinforcement learning
Model-based reinforcement learning

controller 

Run reinforcement learning algorithm in simulator.

apprenticeship learning for the dynamics model
Apprenticeship learning for the dynamics model
  • Algorithms such as E3 (Kearns and Singh, 2002) learn the dynamics by using exploration policies, which are dangerous/impractical for many systems.
  • Our algorithm
    • Initializes model from a demonstration.
    • Repeatedly executes “exploitation policies'' that try to maximize rewards.
    • Provably achieves near-optimal performance (compared to teacher).
  • Machine learning theory:
    • Complicated non-IID sample generating process.
    • Standard learning theory bounds not applicable.
    • Proof uses martingale construction over relative losses.

[ICML 2005]

non stationary maneuvers
Non-stationary maneuvers
  • Modeling extremely complex:
    • Our dynamics model state:
      • Position, orientation, velocity, angular rate.
    • True state:
      • Air (!), head-speed, servos, deformation, etc.
  • Key observation:
    • In the vicinity of a specific point along a specific trajectory, these unknown state variables tend to take on similar values.
local model learning algorithm
Local model learning algorithm

1. Time align trajectories.

2. Learn locally weighted models in the vicinity of the trajectory.

W(t’) = exp(- (t – t’)2 /2 )

algorithm idea w teacher
Algorithm Idea w/Teacher
  • Input to algorithm:
    • Teacher demonstration.
    • Approximate model.

Trajectory predicted by simulator/model for same inputs

Teacher trajectory

[ICML 2006]

algorithm idea w teacher 2
Algorithm Idea w/Teacher (2)
  • Update the model such that it becomes exact for the demonstration.
algorithm idea w teacher 21
Algorithm Idea w/Teacher (2)
  • Update the model such that it becomes exact for the demonstration.
algorithm idea w teacher 22
Algorithm Idea w/Teacher (2)
  • The updated model perfectly predicts the state sequence obtained during the demonstration.
  • We can use the updated model to find a feedback Controller.
algorithm w teacher
Algorithm w/Teacher
  • Record teacher’s demonstration s0, s1, …
  • Update the (crude) model/simulator to be exact for the teacher’s demonstration by adding appropriate time biases for each time step.
  • Return the policy  that is optimal according to the updated model/simulator.
algorithm iterative
Algorithm [iterative]
  • Record teacher’s demonstration s0, s1, …
  • Update the (crude) model/simulator to be exact for the teacher’s demonstration by adding appropriate time biases for each time step.
  • Find the policy  that is optimal according to the updated model/simulator.
  • Execute the policy  and record the state trajectory.
  • Update the (crude) model/simulator to be exact along the trajectory obtained with the policy .
  • Go to step 3.
    • Related work: iterative learning control (ILC).
algorithm
Algorithm
  • Find the (locally) optimal policy  for the model.
  • Execute the current policy  and record the state trajectory.
  • Update the model such that the new model is exact for the current policy .
  • Use the new model to compute the policy gradient  and update the policy:  :=  + .
  • Go back to Step 2.

Notes:

    • The step-size parameter  is determined by a line search.
    • Instead of the policy gradient, any algorithm that provides a local policy improvement direction can be used. In our experiments we used differential dynamic programming.
algorithm1
Algorithm
  • Find the (locally) optimal policy  for the model.
  • Execute the current policy  and record the state trajectory.
  • Update the model such that the new model is exact for the current policy .
  • Use the new model to compute the policy gradient  and update the policy:  :=  + .
  • Go back to Step 2.
    • Related work: Iterative learning control.
acknowledgments
Acknowledgments
  • J. Zico Kolter, Andrew Y. Ng
  • Adam Coates, Morgan Quigley, Andrew Y. Ng
  • Andrew Y. Ng
  • Morgan Quigley, Andrew Y. Ng
teacher demonstration for quadruped
Teacher demonstration for quadruped
  • Full teacher demonstration = sequence of footsteps.
  • Much simpler to “teach hierarchically”:
    • Specify a body path.
    • Specify best footstep in a small area.
hierarchical inverse rl
Hierarchical inverse RL
  • Quadratic programming problem (QP):
    • quadratic objective, linear constraints.
  • Constraint generation for path constraints.
experimental setup
Experimental setup
  • Training:
    • Have quadruped walk straight across a fairly simple board with fixed-spaced foot placements.
    • Around each foot placement: label the best foot placement. (about 20 labels)
    • Label the best body-path for the training board.
  • Use our hierarchical inverse RL algorithm to learn a reward function from the footstep and path labels.
  • Test on hold-out terrains:
    • Plan a path across the test-board.
helicopter flight
Helicopter Flight
  • Task:
    • Hover at a specific point.
    • Initial state: tens of meters away from target.
  • Reward function trades off:
    • Position accuracy,
    • Orientation accuracy,
    • Zero velocity,
    • Zero angular rate,
    • … (11 features total)
more driving examples
More driving examples

Driving demonstration

Learned behavior

Driving demonstration

Learned behavior

In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.