Apprenticeship learning for robotic control

Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work withAndrew Y. Ng, Adam Coates, Morgan Quigley.

Dynamics Model Psa This talk Reinforcement Learning Reward Function R Control policy p Recurring theme: Apprenticeship learning.

Motivation • In practice reward functions are hard to specify, and people tend to tweak them a lot. • Motivating example: helicopter tasks, e.g. flip. • Another motivating example: Highway driving.

Apprenticeship Learning • Learning from observing an expert. • Previous work: • Learn to predict expert’s actions as a function of states. • Usually lacks strong performance guarantees. • (E.g.,. Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …) • Our approach: • Based on inverse reinforcement learning (Ng & Russell, 2000). • Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function. • [Most closely related work: Ratliff et al. 2005, 2006.]

Algorithm • For t = 1,2,… • Inverse RL step: • Estimate expert’s reward function R(s)= wT(s) such that under R(s) the expert performs better than all previously found policies {i}. • RL step: • Compute optimal policy t for • the estimated reward w. [Abbeel & Ng, 2004]

Algorithm: IRL step • Maximize, w:||w||2≤ 1 • s.t. Uw(E)  Uw(i) +  i=1,…,t-1 •  = margin of expert’s performance over the performance of previously found policies. • Uw() = E[t=1R(st)|] = E[t=1wT(st)|] • = wTE[t=1(st)|] • = wT() • () = E[t=1(st)|] are the “feature expectations” T T T T

Feature Expectation Closeness and Performance • If we can find a policy  such that • ||(E) - ()||2 , • then for any underlying reward R*(s) =w*T(s), • we have that • |Uw*(E) - Uw*()| = |w*T (E) - w*T ()| •  ||w*||2 ||(E) - ()||2 • .

Theoretical Results: Convergence • Theorem. Let an MDP (without reward function), a k-dimensional feature vector  and the expert’s feature expectations (E) be given. Then after at most • kT2/2 • iterations, the algorithm outputs a policy  that performs nearly as well as the expert, as evaluated on the unknown reward function R*(s)=w*T(s), i.e., • Uw*()  Uw*(E) - .

Case study: Highway driving Input: Driving demonstration Output: Learned behavior The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.

More driving examples Driving demonstration Learned behavior Driving demonstration Learned behavior In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.

Inverse reinforcement learning summary • Our algorithm returns a policy with performance as good as the expert as evaluated according to the expert’s unknown reward function. • Algorithm is guaranteed to converge in poly(k,1/) iterations. • The algorithm exploits reward “simplicity” (vs. policy “simplicity” in previous approaches).

Dynamics Model Psa Reinforcement Learning Reward Function R Control policy p The dynamics model

Collecting data to learn the dynamics model

Dynamics Model Psa Reinforcement Learning Reward Function R Control policy p Learning the dynamics model Psa from data Estimate Psa from data For example, in discrete-state problems, estimate Psa(s’) to be the fraction of times you transitioned to state s’ after taking action a in state s. Challenge: Collecting enough data to guarantee that you can model the entire flight envelop.

Collecting data to learn dynamical model • State-of-the-art: E3 algorithm (Kearns and Singh, 2002) Have good model of dynamics? YES NO “Exploit” “Explore”

Aggressive exploration (Manual flight) Aggressively exploring the edges of the flight envelope isn’t always a good idea.

Learn Psa Learn Psa Learning the dynamics Autonomous flight Expert human pilot flight Dynamics Model Psa (a1, s1, a2, s2, a3, s3, ….) (a1, s1, a2, s2, a3, s3, ….) Reinforcement Learning Reward Function R Control policy 

Apprenticeship learning of model • Theorem. Suppose that we obtain m = O(poly(S, A, T, 1/e)) examples from a human expert demonstrating the task. Then after a polynomial number k of iterations of testing/re-learning, with high probability, we will obtain a policy p whose performance is comparable to the expert’s: U()  U(E) - e Thus, so long as a demonstration is available, it isn’t necessary to explicitly explore. In practice, k=1 or 2 is almost always enough. [Abbeel & Ng, 2005]

Proof idea • From initial pilot demonstrations, our model/simulator Psa will be accurate for the part of the flight envelop (s,a) visited by the pilot. • Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s policy E. • Consequently, there is at least one policy (namely E) that looks like it’s able to fly the helicopter in our simulation. • Thus, each time we solve the MDP using the current simulator Psa, we will find a policy that successfully flies the helicopter according to Psa. • If, on the actual helicopter, this policy fails to fly the helicopter---despite the model Psa predicting that it should---then it must be visiting parts of the flight envelop that the model is failing to accurately model. • Hence, this gives useful training data to model new parts of the flight envelop.

Configurations flown (exploitation only)

Tail-in funnel

Nose-in funnel

In-place rolls

In place flips

Acknowledgements Andrew Ng, Adam Coates, Morgan Quigley

Thank You!

Apprenticeship learning for robotic control