190 likes | 265 Views
Gain insights into efficiently designing nonlinear policies and employing policy-gradient reinforcement learning in feedback controllers. Explore model-based policy optimization and analytic gradients for improved performance. Discover the applications of Lagrange multiplier approach and LQBR in handling constraints. Dive into antecedents like DDP and MMAC, and understand when LQBR works best. Learn about locally linear policies and other relevant issues in continuous-time control systems.
E N D
Efficient Policy GradientOptimization/Learning of Feedback Controllers Chris Atkeson
Punchlines • Optimize and learn policies. Switch from “value iteration” to “policy iteration”. • This is a big switch from optimizing and learning value functions. • Use gradient-based policy optimization.
Motivations • Efficiently design nonlinear policies • Make policy-gradient reinforcement learning practical.
Model-Based Policy Optimization • Simulate policy u = π(x,p) from some initial states x0 to find policy cost. • Use favorite local or global optimizer to optimize simulated policy cost. • If gradients are used, they are typically numerically estimated. • Δp = -ε ∑x0w(x0)Vp 1st order gradient • Δp = -(∑x0w(x0)Vpp)-1 ∑x0w(x0)Vp 2nd order
Analytic Gradients • Deterministic policy: u = π(x,p) • Policy Iteration (Bellman Equation): Vk-1(x,p) = L(x,π(x,p)) + V(f(x,π(x,p)),p) • Linear models: f(x,u) = f0 + fxΔx + fuΔu L(x,u) = L0 + LxΔx + LuΔu π(x,p) = π0 + πxΔx + πpΔp V(x,p) = V0 + VxΔx + VpΔp • Policy Gradient: Vxk-1 = Lx + Luπx + Vx(fx + fuπx) Vpk-1 = (Lu + Vxfu)πp + Vp
Handling Constraints • Lagrange multiplier approach, with constraint violation value function.
LQBR: Linear (dynamics) Quadratic (cost) Bilinear (policy) Regulator
Antecedents • Optimizing control “parameters” in DDP: Dyer and McReynolds 1970. • Optimal output feedback design (1960s-1970s) • Multiple model adaptive control (MMAC) • Policy gradient reinforcement learning • Adaptive critics, Werbos: HDP, DHP, GDHP, ADHDP, ADDHP
When Will LQBR Work? • Initial stabilizing policy is known (“output stabilizable”) • Luu is positive definite. • Lxx is positive semi-definite and (sqrt(Lxx),Fx) is detectable. • Measurement matrix C has full row rank.
Local Policies GOAL
Other Issues • Model Following • Stochastic Plants • Receding Horizon Control/MPC • Adaptive RHC/MPC • Combine with Dynamic Programming • Dynamic Policies -> Learn State Estimator
Optimize Policies • Policy Iteration, with gradient-based policy improvement step. • Analytic gradients are easy. • Non-overlapping sub-policies make second order gradient calculations fast. • Big problem: How choose policy structure?