Implementing DHP in Software: Taking Control of the Pole-Cart System

Implementing DHP in Software:Taking Control of the Pole-Cart System Lars Holmstrom

Overview • Provides a brief overview of Dual Heuristic Programming (DHP) • Describes a software implementation of DHP for designing a non-linear controller for the pole-cart system • Follows the methodology outlined in • Lendaris, G.G. & J.S. Neidhoefer, 2004, "Guidance in the Use of Adaptive Critics for Control" Ch.4 in "Handbook of Learning and Approximate Dynamic Programming", Si, et al, Eds., IEEE Press & Wiley Interscience, pp. 97-124, 2004.

DHP Foundations • Reinforcement Learning • A process in which an agent learns behaviors through trial-and-error interactions with its environment, based on “reinforcement” signals acquired over time • As opposed to Supervised Learning in which an error signal based on the desired outcome of an action is known, reinforcement signals provide information about a “better” or “worse” action to take rather than the “best” one

DHP Foundations (continued) • Dynamic Programming • Provides a mathematical formalism for finding optimal solutions to control problems within a Markovian decision process • “Cost to Go” Function • Bellman’s Recursion

DHP Foundations (continued) • Adaptive Critics • An application of Reinforcement Learning for solving Dynamic Programming problems • The Critic is charged with the task of estimating J for a particular control policy π • The Critic’s knowledge about J, in turn, allows us to improve the control policy π • This process is iterated until the optimal J surface, J*, is found along with the associated optimal control policy π*

DHP Architecture

Weight Update Calculation for the Action Network

Calculating the Critic Targets

The Pole Cart Problem • The dynamical system (plant) consists of a cart on a length of track with an inverted pendulum attached to it. • The control problem is to balance the inverted pendulum while keeping the cart near the center of the track by applying a horizontal force to the cart. • Pole Cart Animation

Simulating the Plant

Calculating the Instantaneous Derivative

Iterating One Step In Time

Iterating the Model Over a Trajectory

Running the Simulation

Calculating the Model Jacobians • Analytically • Numerical approximation • Backpropagation

Defining a Utility Function • The utility function, along with the plant dynamics, define the optimal control policy • For this example, I will choose • Note: there is no penalty for effort, horizontal velocity (the cart), or angular velocity (the pole)

Setting Up the DHP Training Loop • For each training iteration (step in time) • Measure the current state • Calculate the control to apply • Calculate the control Jacobian • Iterate the model • Calculate the model Jacobian • Calculate the utility derivative • Calculate the present lambda • Calculate the future lambda • Calculate the reinforcement signal for the controller • Train the controller • Calculate the desired target for the critic • Train the critic

Defining an Experiment • Define the neural network architecture for action and critic networks • Define the constants to be used for the model • Set up the lesson plan • Define incremental steps in the learning process • Set us a test plan

Defining an Experiment in the DHP Toolkit

Training Step 1 : 2 Degrees

Training Step 2 : -5 Degrees

Training Step 2 : 15 Degrees

Training Step 2 : -30 Degrees

Testing Step 2 : 20 Degrees

Testing Step 2 : 30 Degrees

Software Availability • This software is available to anyone who would like to make use of it • We also have software available for performing backpropagation through time (BPTT) experiments • Set up an appointment with me or come in during my office hours to get more information about the software

Implementing DHP in Software: Taking Control of the Pole-Cart System

Implementing DHP in Software: Taking Control of the Pole-Cart System

Presentation Transcript

2009 FRC Control System

Remote System Restarter

Taking Control of the Pediatric EMS Call

Carving the Pole

Chapter 5: System Software: Operating Systems and Utility Programs

System Software and Machine Architecture

History taking and physical examination in Cardiovascular system.

Intensive-PBS : Implementing a System of Individual Student Support

Access Control, Operating System Security, and Security System Design

Westgard Multirule System

Control and Coordination

North Pole Diaries

Implementing a Multi-Tiered System of Supports at Response at the Secondary Level

Aircraft Auto Pilot Roll Control System

Multi-Core System on Chip 설계 동향 2 발표 : 조준동 교수 2003 년 12 월

Implementing ECSS Software Engineering Standards at ESOC

Access Control System

CIS 185 CCNP ROUTE Ch. 5 Implementing Path Control

Software Engineering Methods Software Design

Implementing a Three Tier Literacy Model