1 / 32

PEGASUS: A policy search method for large MDP’s and POMDP’s

PEGASUS: A policy search method for large MDP’s and POMDP’s. Andrew Ng, Michael Jordan Presented by: Geoff Levine. Motivation. For large, complicated domains, estimation of value functions/Q-functions can take a long time.

ernie
Download Presentation

PEGASUS: A policy search method for large MDP’s and POMDP’s

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PEGASUS: A policy search method for large MDP’s and POMDP’s Andrew Ng, Michael Jordan Presented by: Geoff Levine

  2. Motivation • For large, complicated domains, estimation of value functions/Q-functions can take a long time. • However, there often exist far simpler policies than the optimal that perform nearly as well. • Can directly search through a policy space

  3. Preliminaries • MDP – M = (S, D, A, {Psa(.)}, γ, R) • S – set of states • D – initial state distribution • A – set of action • Psa(.) : S -> [0,1] – transition probabilities • γ – discount factor • R – deterministic rewards (function of state)

  4. Policies • Policy п : S -> A • Value Function Vп : S -> Reals Vп(s) = R(s) + γ Es’~P(s,п(s))[Vп(s’)] • For convenience, also define: V(п) = Es0~D[Vп(s0)]

  5. Application Domain • Helicopter Flight (Hovering in Place) • 12-d continuous state space ([0,1]12) • (x,y,z,pitch,roll,yaw,x’,y’,z’,pitch’,roll’,yaw’) • 4-d continuous action space ([0,1]4) • (front/back cyclic pitch control,left/right cyclic pitch control main rotor pitch control,tail rotor pitch control) • Timesteps correspond to 1/50th of a second • γ = .9995 • R(s) = -(a(x-x*)2+b(y-y*)2+c(z-z*)2+(yaw-yaw*)2)

  6. Helicopter

  7. Transformation of MDP’s • Given M = (S, D, A, {Psa(.)}, γ, R) we construct M’ = (S’, D’, A, {P’sa(.)}, γ, R’), an MDP with deterministic state transitions • Intuition: Instead of rolling the dice when we move from state to state, we will roll all the dice we need ahead of time, and store their results as part of our state.

  8. Parcheesi … …

  9. Deterministic Simulative Model • Assume we have a deterministic functional representation of our MDP Transitions • g : S x A x [0,1]dp –> S such that if p is distributed uniformly in [0,1]dp then Prp[g(s, a, p) = s’] = Psa(s’). • More powerful than a generative model.

  10. Transformations of MDP’s • S’ = S x [0,1]¥ • D’ – (s, p1, p2, p3, …) such that s ~ D, and the pi’s are drawn iid from Uniform[0,1] • P’ta(t’) ={1 if g(s, a, p1)=s’,0 otherwise}(dP = 1) • R’(t) = R(s) t = (s, p1, p2, p3, …) t’ =(s’, p2, p3, …)

  11. Policies • Given a policy space П for S, consider a corresponding policy space П’ for S’, s.t. • " п in П, $ п’ in П’, " s in S, " p1, p2,… п’((s, p1, p2, p3, …)) = п(s) • As the transition probabilities and rewards are equivalent in the transformed MDP: VM п(s) = Ep~Unif[0,1]^¥[VM’ п’(s,p)] VM(п) = VM’(п’)

  12. Policy Search • VMп(s0) = R(s0) + γ Es’~P(s0,п(s0))[Vп(s’)] • VM’п’((s0,p1,p2,…)) = R(s0)+γR(s1)+γ2R(s2)+… • s1 = g(s0, п’(s0), p1), s2 = g(s1, п’(s1), p2) • As VM(п) = VM’(п’), we can estimate VM(п) = Et0~D’[VM’п’(t0)]

  13. PEGASUS Policy Evaluation-of-Goodness and Search Using Scenarios • Draw a sample of m initial states (scenarios) {s0(1), s0(2), s0(3), …, s0(m)} iid from D’ • Estimate

  14. PEGASUS • Given {s0(1), s0(2), s0(3), …, s0(m)}, is a deterministic function • The sum is infinite, but can truncate the sum after Hε = logγ(ε(1-γ)/2Rmax), introducing at most ε/2 error. Also, this allows us to store our “dice rolls” in finite space.

  15. PEGASUS • Given the deterministic function VM’(п), we can use an optimization technique to find argmaxп VM’(п). • If working in a continuous, smooth, differentiable domain, we can use gradient ascent • If R is discontinuous, may need to use “continuation” methods to smooth it out

  16. Results • On 5x5 Gridworld POMDP, discovers near optimal policy in very few scenarios (~5) • On continuous space/action bicycle riding problem, results near optimal and far better than earlier reward shaping methods.

  17. Helicopter Hovering • Policy represented by a hand-crafted neural network. • PEGASUS used to search through set of possible ANN weights. • Tried both gradient ascent and random walk searches

  18. Neural Network Structure (x,y,z) = (forward, sideways, down) a1 = front/back cyclic pitch control, a2 = left/right cyclic pitch control a3 = main rotor pitch control a4 = tail rotor pitch control

  19. Results • Able to keep helicopter stable on its maiden flight. Hovering • Neural network modified to fly competition class maneuvers Triangle • Finally, hovering upside down accomplished • http://ai.stanford.edu/~ang/rl-videos/helicopter/

  20. Pseudo-Dimension • H set of functions X -> Reals • H shatters x1, x2, …, xd ε X if there exists a sequence of real numbers t1, t2, …, td s.t. {(h(x1) – t1, h(x2) – t2, …, h(xd) – td)| h ε H} intersects all 2d orthants of Rd • The pseudo-dimension of H (dimp(H)) is the size of the largest set shattered by H

  21. Lipschitz Continuity • A function f is Lipschitz continuous with Lipschitz bound B if ||f(x) – f(y)|| <= B||x – y|| (with respect to Euclidean norm on range and domain)

  22. Realizable Dynamics in an MDP • Let S = [0, 1]ds, g: S x A x [0, 1]dp -> S be given. • We can define Fi as a set of functions {Fia: S x [0, 1]dp -> [0, 1], Fia(s, p1,…,pdp) = Ii(g(s, a, p1,…,pdp))| "a in A} Ii(x) returns the ith coordinate of x

  23. PEGASUS Theoretical Result • Let S = [0, 1]ds, policy class П, and model g: S x A x [0, 1]dp -> S be given. • F is the family of realizable dynamics in the MDP and Fi the resulting family of coordinate functions. For all i, let dimP(Fi) <= d, and let Fi be uniformly Lipschitz continuous with bound B • Reward Function R is Lipschitz continuous with bound BR. • Then if: with probability at least 1 – δ, the PEGASUS estimate V’(п) will be uniformly close to the the actual value |V’(п) – V (п)| <= ε

  24. Proof (1) • Think of the reward at step i as a random variable Vп(s0(1)) = R(so(1)) + γ R(s1(1)) + γ2 R(s2(1)) +… Vп(s0(2)) = R(so(2)) + γ R(s1(2)) + γ2 R(s2(2)) +… Vп(s0(3)) = R(so(3)) + γ R(s1(3)) + γ2 R(s2(3)) +… • By bounding properties of each R(si(j)), we can prove uniform convergence for V(п)

  25. Proof (2) • Calling on work by Haussler, we show that if the psuedo-dimension of each Fi, dimP(Fi) <= d, we can “nearly” represent our world dynamics functions Fia by a smaller set of functions of size

  26. Proof (3) • Similarly if Fi uniformly has Lipschitz bound B, and the Reward function R has Lipschitz bound BR, we can “nearly” represent a function mapping from scenarios to ith step rewards by a set of size

  27. Proof (4) • A result by Haussler then shows that with probability 1 – δ, our ith step reward will be ε-close to the mean if we select a number of scenarios bounded by

  28. Proof (5) • Strengthening the bound to account for all Hε rewards and employing the Union bound, we find that a number of scenarios bounded by is sufficient.

  29. Critique • Success limited to very small fairly linear control problem, with high frequency controller • Lots of human bias incorporated into system • Restrictions/Linear Regression for model identification • Structure of neural net for each of the tasks • PAC learning guarantees still out of reach • No theoretical bounds on final policy

  30. Bibliography • Chapter on PAC learning model, and decision-theoretic generalizations, with applications to neural nets. From Mathematical Perspectives on Neural Networks, Lawrence Erlbaum Associates, 1995, Information and Computation, Vol. 100, September, 1992, pp. 78-150. • Ng, A. Y., Jordan, M. I. PEGASUS: A policy search method for large MDP’s and POMDP’s. In Uncertainty in Artificial Intelligence, Sixth Conference, 2000. • Ng, A. Y., Kim, H. J., Jordan, M. I., & Sastry, S. Autonomous helicopter flight via reinforcement learning. Advances in Neural Information Processing Systems 16. 2004. • Ng, A. Y.,Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., and Liang, E. Inverted autonomous helicopter flight via reinforcement learning, In International Symposium on Experimental Robotics, 2004.

  31. Application – Helicoptor Flight • PEGASUS has been used to derive policies for hovering in place. • Later generalized to handle slow motion maneuvers and upside down hovering. • GPS system relays state information (position and velocity) to an off board computer which calculates a 4-dimensional action

  32. Model Identification • Construction of an MDP representation of the world dynamics • Transition Dynamics learned from several minutes of data based on human flight • Fit using linear regression • Forced to respect innate properties of the domain (gravity, symmetry)

More Related