Efficient policy gradient optimization learning of feedback controllers
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

Efficient Policy Gradient Optimization/Learning of Feedback Controllers PowerPoint PPT Presentation


  • 44 Views
  • Uploaded on
  • Presentation posted in: General

Efficient Policy Gradient Optimization/Learning of Feedback Controllers. Chris Atkeson. Punchlines. Optimize and learn policies. Switch from “value iteration” to “policy iteration”. This is a big switch from optimizing and learning value functions. Use gradient-based policy optimization.

Download Presentation

Efficient Policy Gradient Optimization/Learning of Feedback Controllers

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Efficient policy gradient optimization learning of feedback controllers

Efficient Policy GradientOptimization/Learning of Feedback Controllers

Chris Atkeson


Punchlines

Punchlines

  • Optimize and learn policies.

    Switch from “value iteration” to “policy iteration”.

  • This is a big switch from optimizing and learning value functions.

  • Use gradient-based policy optimization.


Motivations

Motivations

  • Efficiently design nonlinear policies

  • Make policy-gradient reinforcement learning practical.


Model based policy optimization

Model-Based Policy Optimization

  • Simulate policy u = π(x,p) from some initial states x0 to find policy cost.

  • Use favorite local or global optimizer to optimize simulated policy cost.

  • If gradients are used, they are typically numerically estimated.

  • Δp = -ε ∑x0w(x0)Vp 1st order gradient

  • Δp = -(∑x0w(x0)Vpp)-1 ∑x0w(x0)Vp 2nd order


Can we make model based policy gradient more efficient

Can we make model-based policy gradient more efficient?


Analytic gradients

Analytic Gradients

  • Deterministic policy: u = π(x,p)

  • Policy Iteration (Bellman Equation):

    Vk-1(x,p) = L(x,π(x,p)) + V(f(x,π(x,p)),p)

  • Linear models: f(x,u) = f0 + fxΔx + fuΔu

    L(x,u) = L0 + LxΔx + LuΔu

    π(x,p) = π0 + πxΔx + πpΔp

    V(x,p) = V0 + VxΔx + VpΔp

  • Policy Gradient:

    Vxk-1 = Lx + Luπx + Vx(fx + fuπx)

    Vpk-1 = (Lu + Vxfu)πp + Vp


Handling constraints

Handling Constraints

  • Lagrange multiplier approach, with constraint violation value function.


V pp second order models

Vpp: Second Order Models


Regularization

Regularization


Lqbr linear dynamics quadratic cost bilinear policy regulator

LQBR: Linear (dynamics) Quadratic (cost) Bilinear (policy) Regulator


Timing test

Timing Test


Antecedents

Antecedents

  • Optimizing control “parameters” in DDP: Dyer and McReynolds 1970.

  • Optimal output feedback design (1960s-1970s)

  • Multiple model adaptive control (MMAC)

  • Policy gradient reinforcement learning

  • Adaptive critics, Werbos: HDP, DHP, GDHP, ADHDP, ADDHP


When will lqbr work

When Will LQBR Work?

  • Initial stabilizing policy is known (“output stabilizable”)

  • Luu is positive definite.

  • Lxx is positive semi-definite and (sqrt(Lxx),Fx) is detectable.

  • Measurement matrix C has full row rank.


Locally linear policies

Locally Linear Policies


Local policies

Local Policies

GOAL


Cost of one gradient calculation

Cost Of One Gradient Calculation


Continuous time

Continuous Time


Other issues

Other Issues

  • Model Following

  • Stochastic Plants

  • Receding Horizon Control/MPC

  • Adaptive RHC/MPC

  • Combine with Dynamic Programming

  • Dynamic Policies -> Learn State Estimator


Optimize policies

Optimize Policies

  • Policy Iteration, with gradient-based policy improvement step.

  • Analytic gradients are easy.

  • Non-overlapping sub-policies make second order gradient calculations fast.

  • Big problem: How choose policy structure?


  • Login