Efficient policy gradient optimization learning of feedback controllers
Download
1 / 19

Efficient Policy Gradient Optimization/Learning of Feedback Controllers - PowerPoint PPT Presentation


  • 56 Views
  • Uploaded on

Efficient Policy Gradient Optimization/Learning of Feedback Controllers. Chris Atkeson. Punchlines. Optimize and learn policies. Switch from “value iteration” to “policy iteration”. This is a big switch from optimizing and learning value functions. Use gradient-based policy optimization.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Efficient Policy Gradient Optimization/Learning of Feedback Controllers' - peter-mendez


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient policy gradient optimization learning of feedback controllers

Efficient Policy GradientOptimization/Learning of Feedback Controllers

Chris Atkeson


Punchlines
Punchlines

  • Optimize and learn policies.

    Switch from “value iteration” to “policy iteration”.

  • This is a big switch from optimizing and learning value functions.

  • Use gradient-based policy optimization.


Motivations
Motivations

  • Efficiently design nonlinear policies

  • Make policy-gradient reinforcement learning practical.


Model based policy optimization
Model-Based Policy Optimization

  • Simulate policy u = π(x,p) from some initial states x0 to find policy cost.

  • Use favorite local or global optimizer to optimize simulated policy cost.

  • If gradients are used, they are typically numerically estimated.

  • Δp = -ε ∑x0w(x0)Vp 1st order gradient

  • Δp = -(∑x0w(x0)Vpp)-1 ∑x0w(x0)Vp 2nd order



Analytic gradients
Analytic Gradients

  • Deterministic policy: u = π(x,p)

  • Policy Iteration (Bellman Equation):

    Vk-1(x,p) = L(x,π(x,p)) + V(f(x,π(x,p)),p)

  • Linear models: f(x,u) = f0 + fxΔx + fuΔu

    L(x,u) = L0 + LxΔx + LuΔu

    π(x,p) = π0 + πxΔx + πpΔp

    V(x,p) = V0 + VxΔx + VpΔp

  • Policy Gradient:

    Vxk-1 = Lx + Luπx + Vx(fx + fuπx)

    Vpk-1 = (Lu + Vxfu)πp + Vp


Handling constraints
Handling Constraints

  • Lagrange multiplier approach, with constraint violation value function.


V pp second order models
Vpp: Second Order Models




Timing test
Timing Test Regulator


Antecedents
Antecedents Regulator

  • Optimizing control “parameters” in DDP: Dyer and McReynolds 1970.

  • Optimal output feedback design (1960s-1970s)

  • Multiple model adaptive control (MMAC)

  • Policy gradient reinforcement learning

  • Adaptive critics, Werbos: HDP, DHP, GDHP, ADHDP, ADDHP


When will lqbr work
When Will LQBR Work? Regulator

  • Initial stabilizing policy is known (“output stabilizable”)

  • Luu is positive definite.

  • Lxx is positive semi-definite and (sqrt(Lxx),Fx) is detectable.

  • Measurement matrix C has full row rank.



Local policies
Local Policies Regulator

GOAL



Continuous time
Continuous Time Regulator


Other issues
Other Issues Regulator

  • Model Following

  • Stochastic Plants

  • Receding Horizon Control/MPC

  • Adaptive RHC/MPC

  • Combine with Dynamic Programming

  • Dynamic Policies -> Learn State Estimator


Optimize policies
Optimize Policies Regulator

  • Policy Iteration, with gradient-based policy improvement step.

  • Analytic gradients are easy.

  • Non-overlapping sub-policies make second order gradient calculations fast.

  • Big problem: How choose policy structure?


ad