1 / 19

Efficient Policy Gradient Optimization/Learning of Feedback Controllers

Efficient Policy Gradient Optimization/Learning of Feedback Controllers. Chris Atkeson. Punchlines. Optimize and learn policies. Switch from “value iteration” to “policy iteration”. This is a big switch from optimizing and learning value functions. Use gradient-based policy optimization.

laszlo
Download Presentation

Efficient Policy Gradient Optimization/Learning of Feedback Controllers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Policy GradientOptimization/Learning of Feedback Controllers Chris Atkeson

  2. Punchlines • Optimize and learn policies. Switch from “value iteration” to “policy iteration”. • This is a big switch from optimizing and learning value functions. • Use gradient-based policy optimization.

  3. Motivations • Efficiently design nonlinear policies • Make policy-gradient reinforcement learning practical.

  4. Model-Based Policy Optimization • Simulate policy u = π(x,p) from some initial states x0 to find policy cost. • Use favorite local or global optimizer to optimize simulated policy cost. • If gradients are used, they are typically numerically estimated. • Δp = -ε ∑x0w(x0)Vp 1st order gradient • Δp = -(∑x0w(x0)Vpp)-1 ∑x0w(x0)Vp 2nd order

  5. Can we make model-based policy gradient more efficient?

  6. Analytic Gradients • Deterministic policy: u = π(x,p) • Policy Iteration (Bellman Equation): Vk-1(x,p) = L(x,π(x,p)) + V(f(x,π(x,p)),p) • Linear models: f(x,u) = f0 + fxΔx + fuΔu L(x,u) = L0 + LxΔx + LuΔu π(x,p) = π0 + πxΔx + πpΔp V(x,p) = V0 + VxΔx + VpΔp • Policy Gradient: Vxk-1 = Lx + Luπx + Vx(fx + fuπx) Vpk-1 = (Lu + Vxfu)πp + Vp

  7. Handling Constraints • Lagrange multiplier approach, with constraint violation value function.

  8. Vpp: Second Order Models

  9. Regularization

  10. LQBR: Linear (dynamics) Quadratic (cost) Bilinear (policy) Regulator

  11. Timing Test

  12. Antecedents • Optimizing control “parameters” in DDP: Dyer and McReynolds 1970. • Optimal output feedback design (1960s-1970s) • Multiple model adaptive control (MMAC) • Policy gradient reinforcement learning • Adaptive critics, Werbos: HDP, DHP, GDHP, ADHDP, ADDHP

  13. When Will LQBR Work? • Initial stabilizing policy is known (“output stabilizable”) • Luu is positive definite. • Lxx is positive semi-definite and (sqrt(Lxx),Fx) is detectable. • Measurement matrix C has full row rank.

  14. Locally Linear Policies

  15. Local Policies GOAL

  16. Cost Of One Gradient Calculation

  17. Continuous Time

  18. Other Issues • Model Following • Stochastic Plants • Receding Horizon Control/MPC • Adaptive RHC/MPC • Combine with Dynamic Programming • Dynamic Policies -> Learn State Estimator

  19. Optimize Policies • Policy Iteration, with gradient-based policy improvement step. • Analytic gradients are easy. • Non-overlapping sub-policies make second order gradient calculations fast. • Big problem: How choose policy structure?

More Related