260 likes | 393 Views
Implementing DHP in Software: Taking Control of the Pole-Cart System. Lars Holmstrom. Overview. Provides a brief overview of Dual Heuristic Programming (DHP) Describes a software implementation of DHP for designing a non-linear controller for the pole-cart system
E N D
Implementing DHP in Software:Taking Control of the Pole-Cart System Lars Holmstrom
Overview • Provides a brief overview of Dual Heuristic Programming (DHP) • Describes a software implementation of DHP for designing a non-linear controller for the pole-cart system • Follows the methodology outlined in • Lendaris, G.G. & J.S. Neidhoefer, 2004, "Guidance in the Use of Adaptive Critics for Control" Ch.4 in "Handbook of Learning and Approximate Dynamic Programming", Si, et al, Eds., IEEE Press & Wiley Interscience, pp. 97-124, 2004.
DHP Foundations • Reinforcement Learning • A process in which an agent learns behaviors through trial-and-error interactions with its environment, based on “reinforcement” signals acquired over time • As opposed to Supervised Learning in which an error signal based on the desired outcome of an action is known, reinforcement signals provide information about a “better” or “worse” action to take rather than the “best” one
DHP Foundations (continued) • Dynamic Programming • Provides a mathematical formalism for finding optimal solutions to control problems within a Markovian decision process • “Cost to Go” Function • Bellman’s Recursion
DHP Foundations (continued) • Adaptive Critics • An application of Reinforcement Learning for solving Dynamic Programming problems • The Critic is charged with the task of estimating J for a particular control policy π • The Critic’s knowledge about J, in turn, allows us to improve the control policy π • This process is iterated until the optimal J surface, J*, is found along with the associated optimal control policy π*
The Pole Cart Problem • The dynamical system (plant) consists of a cart on a length of track with an inverted pendulum attached to it. • The control problem is to balance the inverted pendulum while keeping the cart near the center of the track by applying a horizontal force to the cart. • Pole Cart Animation
Calculating the Model Jacobians • Analytically • Numerical approximation • Backpropagation
Defining a Utility Function • The utility function, along with the plant dynamics, define the optimal control policy • For this example, I will choose • Note: there is no penalty for effort, horizontal velocity (the cart), or angular velocity (the pole)
Setting Up the DHP Training Loop • For each training iteration (step in time) • Measure the current state • Calculate the control to apply • Calculate the control Jacobian • Iterate the model • Calculate the model Jacobian • Calculate the utility derivative • Calculate the present lambda • Calculate the future lambda • Calculate the reinforcement signal for the controller • Train the controller • Calculate the desired target for the critic • Train the critic
Defining an Experiment • Define the neural network architecture for action and critic networks • Define the constants to be used for the model • Set up the lesson plan • Define incremental steps in the learning process • Set us a test plan
Software Availability • This software is available to anyone who would like to make use of it • We also have software available for performing backpropagation through time (BPTT) experiments • Set up an appointment with me or come in during my office hours to get more information about the software