1 / 21

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 20: Approximate & Neuro Dynamic Programming, Policy Gradient Methods. November 15, 2010. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee

wesley
Download Presentation

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 517: Reinforcement Learning in Artificial IntelligenceLecture 20: Approximate & Neuro Dynamic Programming, Policy Gradient Methods November 15, 2010 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010

  2. Introduction • We will discuss different methods that heuristically approximate the dynamic programming problem • Approximate Dynamic Programming • Direct Neuro-Dynamic Programming • Assumptions: • DP methods assume fully observed systems • ADP rely on no model and (usually) partial observability (POMDPs) • Relationship to classical control theories • Optimal control – in the linear case, is a solved problem. It estimates the state vector using Kalman filter methodologies • Adaptive control responds to the question: what can we do when the dynamics of the system are unknown • We have a model of the plant but lack its parameters (values) • Often focus on stability properties rather than performance

  3. Reference to classic control theories (cont.) • Robust control • Attempt to find a controller design that guarantees stability, i.e. that the plant will not “blow up” • Regardless of what the unknown parameter values are • e.g. Lyapunov-based analysis (e.g. queueuing systems) • Adaptive control • Attempt to adapt the controller in real time, based on real-time observations of how the plant actually behaves • ADP may be viewed as an adaptive control framework • “Neural-observers” – used to predict the next set of observations, based on which the controller acts

  4. Core principals of ADP • The following are three general principles that are at the core of ADP • Value approximation – instead of solving for V(s) exactly, we can use a universal approximation function V(s,W) ~ V(s) • Alternate starting points – instead of always starting from the Bellman equation directly, we can start from related recurrence equations • Hybrid design – combining multiple ADP systems in more complex hybrid designs • Usually in order to scale better • Mixture of continuous and discrete variables • Multiple spatio-temporal scales

  5. Direct Neuro-Dynamic Programming (Direct NDP) • Motivation • The intuitive appeal of Reinforcement Learning, in particular the actor/critic design • The power of calculus of variation and this concept used in the form of backpropagation to solve optimal control problems • Can inherently deal with POMDPs (using RNNs, for example) • The method is considered “direct” in that • It does not have explicit state representation • Temporal progression – everything is a function of time and not state/action sets • It is also model-free as it does not assume a model or attempt to directly estimate model dynamics/structure

  6. Direct NDP Architecture • Critic Network : estimate the future reward-to-go (i.e. value function) • Action Network : adjust action to minimize the difference between the estimated J and the ultimate objective Uc.

  7. Direct NDP vs. Classic RL Environment J(t-1)-r(t) a U (t) J(t) c Action Reward Critic Network u(t) State Action Network Agent: direct NDP

  8. Inverted Helicopter Flight (Ng. / Stanford 2004)

  9. Solving POMDPs with RNNs • Case study: framework for obtaining optimal policy in model-free POMDPs using Recurrent Neural Networks • Uses NDP version of Q-Learning • TRTRL is employed(efficient version of RTRL) • Goal: investigate scenarioin which two states havethe same observation(yet different optimalactions) • Method: RNNs in aTD framework (morelater) • Model is unknown!

  10. Direct NDP architecture using RNNs Ot Q(st ,at) ~ RNN Q-Learning approx. Softmax TD at rt Final action Environment Method is good for small action sets. Q: why ?

  11. Simulation results – 10 neurons

  12. Training Robots (1st-gen AIBOs) to walk (faster) • 1st generation AIBOs were used (internal CPU) • Fundamental motor capabilities were prescribed • e.g. apply torque to a given joint, turn in a given direction • In other words, finite action set • Observations were limited to distance (a.k.a. radar view) • The goal was to cross the field in short time • Reward was growing negative as time progressed • Large positive reward when goal was met • Multiple robots were trained to observe variability in the learning process

  13. The general RL approach revisited • RL will solve all of your problems, but … • We need lots of experience to train from • Taking random actions can be dangerous • It can take a long time to learn • Not all problems fit into the NDP framework • An alternative approach to RL is to reward whole policies, rather than individual actions • Run whole policy, then receive a single reward • Reward measures success of the entire policy • If there are a small number of policies, we can exhaustively try them all • However, this is not possible in most interesting problems

  14. This is another learning rate Policy Gradient Methods • Assume that our policy, p, has a set of n real-valued parameters, q = {q1, q2, q3, ... , qn} • Running the policy with a particular q results in a reward, rq • Estimate the reward gradient, , for each qi

  15. Policy Gradient Methods (cont.) • This results in hill-climbing in policy space • So, it’s subject to all the problems of hill-climbing • But, we can also use tricks from search theory, like random restarts and momentum terms • This is a good approach if you have a parameterized policy • Let’s assume we have a “reasonable” starting policy • Typically faster than value-based methods • “Safe” exploration, if you have a good policy • Learns locally-best parameters for that policy

  16. An Example: Learning to Walk • RoboCup 4-legged league • Walking quickly is a big advantage • Until recently, this was tuned manually • Robots have a parameterized gait controller • 12 parameters • Controls step length, height, etc. • Robot walk across soccer field and is timed • Reward is a function of the time taken • They know when to stop (distance measure)

  17. An Example: Learning to Walk (cont.) • Basic idea • Pick an initial q = {q1, q2, ... , q12} • Generate N testing parameter settings by perturbing q qj = {q1 + d1, q2 + d2, ... , q12 + d12}, di {-e, 0, e} • Test each setting, and observe rewards qj→ rj • For each qi q Calculate qi+, qi0, qi- and set • Set q ← q’, and go to 2 Average reward when qni = qi - di

  18. An Example: Learning to Walk (cont.) • Q: Can we translate Gradient Policy into a direct policy, actor/critic Neuro-Dynamic Programming system? Initial Final

  19. Value Function or Policy Gradient? • When should I use policy gradient? • When there’s a parameterized policy • When there’s a high-dimensional state space • When we expect the gradient to be smooth • Typically one episodic tasks (e.g. AIBO walking) • When should I use a value-based method? • When there is no parameterized policy • When we have no idea how to solve the problem (i.e. no known structure)

  20. Direct NDP with RNNs – Backpropagation through a model • RNNs have memory and can create temporal context • Applies to both actor and critic • Much harder to train (time and logic/memory resources) • e.g. RTRL issues

  21. Consolidated Actor-Critic Model (Z. Liu, I. Arel, 2007) • Single network (FF or RNN) sufficient for both actor and critic functions • Two-pass (TD-style) for both action and value estimate corrections • Training via standard techniques, e.g. BP

More Related