efficient policy gradient optimization learning of feedback controllers
Download
Skip this Video
Download Presentation
Efficient Policy Gradient Optimization/Learning of Feedback Controllers

Loading in 2 Seconds...

play fullscreen
1 / 19

Efficient Policy Gradient Optimization/Learning of Feedback Controllers - PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on

Efficient Policy Gradient Optimization/Learning of Feedback Controllers. Chris Atkeson. Punchlines. Optimize and learn policies. Switch from “value iteration” to “policy iteration”. This is a big switch from optimizing and learning value functions. Use gradient-based policy optimization.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Efficient Policy Gradient Optimization/Learning of Feedback Controllers' - peter-mendez


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
punchlines
Punchlines
  • Optimize and learn policies.

Switch from “value iteration” to “policy iteration”.

  • This is a big switch from optimizing and learning value functions.
  • Use gradient-based policy optimization.
motivations
Motivations
  • Efficiently design nonlinear policies
  • Make policy-gradient reinforcement learning practical.
model based policy optimization
Model-Based Policy Optimization
  • Simulate policy u = π(x,p) from some initial states x0 to find policy cost.
  • Use favorite local or global optimizer to optimize simulated policy cost.
  • If gradients are used, they are typically numerically estimated.
  • Δp = -ε ∑x0w(x0)Vp 1st order gradient
  • Δp = -(∑x0w(x0)Vpp)-1 ∑x0w(x0)Vp 2nd order
analytic gradients
Analytic Gradients
  • Deterministic policy: u = π(x,p)
  • Policy Iteration (Bellman Equation):

Vk-1(x,p) = L(x,π(x,p)) + V(f(x,π(x,p)),p)

  • Linear models: f(x,u) = f0 + fxΔx + fuΔu

L(x,u) = L0 + LxΔx + LuΔu

π(x,p) = π0 + πxΔx + πpΔp

V(x,p) = V0 + VxΔx + VpΔp

  • Policy Gradient:

Vxk-1 = Lx + Luπx + Vx(fx + fuπx)

Vpk-1 = (Lu + Vxfu)πp + Vp

handling constraints
Handling Constraints
  • Lagrange multiplier approach, with constraint violation value function.
antecedents
Antecedents
  • Optimizing control “parameters” in DDP: Dyer and McReynolds 1970.
  • Optimal output feedback design (1960s-1970s)
  • Multiple model adaptive control (MMAC)
  • Policy gradient reinforcement learning
  • Adaptive critics, Werbos: HDP, DHP, GDHP, ADHDP, ADDHP
when will lqbr work
When Will LQBR Work?
  • Initial stabilizing policy is known (“output stabilizable”)
  • Luu is positive definite.
  • Lxx is positive semi-definite and (sqrt(Lxx),Fx) is detectable.
  • Measurement matrix C has full row rank.
other issues
Other Issues
  • Model Following
  • Stochastic Plants
  • Receding Horizon Control/MPC
  • Adaptive RHC/MPC
  • Combine with Dynamic Programming
  • Dynamic Policies -> Learn State Estimator
optimize policies
Optimize Policies
  • Policy Iteration, with gradient-based policy improvement step.
  • Analytic gradients are easy.
  • Non-overlapping sub-policies make second order gradient calculations fast.
  • Big problem: How choose policy structure?
ad