1 / 19

# Efficient Policy Gradient Optimization/Learning of Feedback Controllers - PowerPoint PPT Presentation

Efficient Policy Gradient Optimization/Learning of Feedback Controllers. Chris Atkeson. Punchlines. Optimize and learn policies. Switch from “value iteration” to “policy iteration”. This is a big switch from optimizing and learning value functions. Use gradient-based policy optimization.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Efficient Policy Gradient Optimization/Learning of Feedback Controllers' - peter-mendez

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Efficient Policy GradientOptimization/Learning of Feedback Controllers

Chris Atkeson

• Optimize and learn policies.

Switch from “value iteration” to “policy iteration”.

• This is a big switch from optimizing and learning value functions.

• Efficiently design nonlinear policies

• Make policy-gradient reinforcement learning practical.

• Simulate policy u = π(x,p) from some initial states x0 to find policy cost.

• Use favorite local or global optimizer to optimize simulated policy cost.

• If gradients are used, they are typically numerically estimated.

• Δp = -ε ∑x0w(x0)Vp 1st order gradient

• Δp = -(∑x0w(x0)Vpp)-1 ∑x0w(x0)Vp 2nd order

• Deterministic policy: u = π(x,p)

• Policy Iteration (Bellman Equation):

Vk-1(x,p) = L(x,π(x,p)) + V(f(x,π(x,p)),p)

• Linear models: f(x,u) = f0 + fxΔx + fuΔu

L(x,u) = L0 + LxΔx + LuΔu

π(x,p) = π0 + πxΔx + πpΔp

V(x,p) = V0 + VxΔx + VpΔp

Vxk-1 = Lx + Luπx + Vx(fx + fuπx)

Vpk-1 = (Lu + Vxfu)πp + Vp

• Lagrange multiplier approach, with constraint violation value function.

Vpp: Second Order Models

Timing Test Regulator

Antecedents Regulator

• Optimizing control “parameters” in DDP: Dyer and McReynolds 1970.

• Optimal output feedback design (1960s-1970s)

• Multiple model adaptive control (MMAC)

When Will LQBR Work? Regulator

• Initial stabilizing policy is known (“output stabilizable”)

• Luu is positive definite.

• Lxx is positive semi-definite and (sqrt(Lxx),Fx) is detectable.

• Measurement matrix C has full row rank.

Local Policies Regulator

GOAL

Continuous Time Regulator

Other Issues Regulator

• Model Following

• Stochastic Plants

• Receding Horizon Control/MPC

• Combine with Dynamic Programming

• Dynamic Policies -> Learn State Estimator

Optimize Policies Regulator

• Policy Iteration, with gradient-based policy improvement step.