Efficient Policy Gradient Optimization/Learning of Feedback Controllers. Chris Atkeson. Punchlines. Optimize and learn policies. Switch from “value iteration” to “policy iteration”. This is a big switch from optimizing and learning value functions. Use gradient-based policy optimization.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Switch from “value iteration” to “policy iteration”.
Vk-1(x,p) = L(x,π(x,p)) + V(f(x,π(x,p)),p)
L(x,u) = L0 + LxΔx + LuΔu
π(x,p) = π0 + πxΔx + πpΔp
V(x,p) = V0 + VxΔx + VpΔp
Vxk-1 = Lx + Luπx + Vx(fx + fuπx)
Vpk-1 = (Lu + Vxfu)πp + Vp