1 / 58

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING. Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3. Autonomous Learning Laboratory – Department of Computer Science. Lecture 1: What is Computational Reinforcement Learning?

Download Presentation

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction toCOMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous Learning Laboratory – Department of Computer Science

  2. Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning The Overall Plan A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  3. Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning The Overall Plan A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  4. Lecture 3, Part 1: Generalization and Function Approximation • Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part • Overview of function approximation (FA) methods and how they can be adapted to RL Objectives of this part:

  5. Value Prediction with FA As usual: Policy Evaluation (the prediction problem): for a given policy p, compute the state-value function In earlier chapters, value functions were stored in lookup tables. A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  6. Adapt Supervised Learning Algorithms Training Info = desired (target) outputs Supervised Learning System Inputs Outputs Training example = {input, target output} Error = (target output – actual output) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  7. Backups as Training Examples As a training example: input target output A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  8. Any FA Method? • In principle, yes: • artificial neural networks • decision trees • multivariate regression methods • etc. • But RL has some special requirements: • usually want to learn while interacting • ability to handle nonstationarity • other? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  9. Gradient Descent Methods transpose A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  10. Performance Measures • Many are applicable but… • a common and simple one is the mean-squared error (MSE) over a distribution P : • Why P ? • Why minimize MSE? • Let us assume that P is always the distribution of states with which backups are done. • The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution. A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  11. Gradient Descent Iteratively move down the gradient: A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  12. Gradient Descent Cont. For the MSE given above and using the chain rule: A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  13. Gradient Descent Cont. Use just the sample gradient instead: Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum of the MSE if a decreases appropriately with t. A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  14. But We Don’t have these Targets A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  15. What about TD(l) Targets? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  16. On-Line Gradient-Descent TD(l) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  17. Linear Methods A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  18. Nice Properties of Linear FA Methods • The gradient is very simple: • For MSE, the error surface is simple: quadratic surface with a single minumum. • Linear gradient descent TD(l) converges: • Step size decreases appropriately • On-line sampling (states sampled from the on-policy distribution) • Converges to parameter vector with property: best parameter vector (Tsitsiklis & Van Roy, 1997) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  19. Coarse Coding A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  20. Learning and Coarse Coding A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  21. Tile Coding • Binary feature for each tile • Number of features present at any one time is constant • Binary features means weighted sum easy to compute • Easy to compute indices of the freatures present A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  22. Tile Coding Cont. Irregular tilings Hashing CMAC “Cerebellar Model Arithmetic Computer” Albus 1971 A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  23. Radial Basis Functions (RBFs) e.g., Gaussians A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  24. Can you beat the “curse of dimensionality”? • Can you keep the number of features from going up exponentially with the dimension? • Function complexity, not dimensionality, is the problem. • Kanerva coding: • Select a bunch of binary prototypes • Use hamming distance as distance measure • Dimensionality is no longer a problem, only complexity • “Lazy learning” schemes: • Remember all the data • To get new value, find nearest neighbors and interpolate • e.g., locally-weighted regression A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  25. Learning state-action values The general gradient-descent rule: Gradient-descent Sarsa(l) (backward view): Training examples of the form: Control with FA A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  26. Linear Gradient Descent Sarsa(l) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  27. Mountain-Car Task A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  28. Mountain-Car Results A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  29. Baird’s Counterexample A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  30. Baird’s Counterexample Cont. A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  31. Should We Bootstrap? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  32. Summary • Generalization • Adapting supervised-learning function approximation methods • Gradient-descent methods • Linear gradient-descent methods • Radial basis functions • Tile coding • Kanerva coding • Nonlinear gradient-descent methods? Backpropation? • Subleties involving function approximation, bootstrapping and the on-policy/off-policy distinction A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  33. Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning The Overall Plan A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  34. Lecture 3, Part 2: Model-Based Methods • Use of environment models • Integration of planning and learning methods Objectives of this part: A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  35. Models • Model: anything the agent can use to predict how the environment will respond to its actions • Distribution model: description of all possibilities and their probabilities • e.g., • Sample model: produces sample experiences • e.g., a simulation model • Both types of models can be used to produce simulated experience • Often sample models are much easier to come by A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  36. Planning • Planning: any computational process that uses a model to create or improve a policy • Planning in AI: • state-space planning • plan-space planning (e.g., partial-order planner) • We take the following (unusual) view: • all state-space planning methods involve computing value functions, either explicitly or implicitly • they all apply backups to simulated experience A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  37. Planning Cont. • Classical DP methods are state-space planning methods • Heuristic search methods are state-space planning methods • A planning method based on Q-learning: Random-Sample One-Step Tabular Q-Planning A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  38. Learning, Planning, and Acting • Two uses of real experience: • model learning: to improve the model • direct RL: to directly improve the value function and policy • Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here,we call it planning. A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  39. Indirect (model-based) methods: make fuller use of experience: get better policy with fewer environment interactions Direct methods simpler not affected by bad models Direct vs. Indirect RL But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  40. The Dyna Architecture (Sutton 1990) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  41. The Dyna-Q Algorithm direct RL model learning planning A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  42. Dyna-Q on a Simple Maze rewards = 0 until goal, when =1 A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  43. Dyna-Q Snapshots: Midway in 2nd Episode A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  44. Prioritized Sweeping • Which states or state-action pairs should be generated during planning? • Work backwards from states whose values have just changed: • Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change • When a new backup occurs, insert predecessors according to their priorities • Always perform backups from first in queue • Moore and Atkeson 1993; Peng and Williams, 1993 A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  45. Full and Sample (One-Step) Backups A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  46. Full vs. Sample Backups b successor states, equally likely; initial error = 1; assume all next states’ values are correct A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  47. Trajectory Sampling • Trajectory sampling: perform backups along simulated trajectories • This samples from the on-policy distribution • Advantages when function approximation is used • Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored: Initial states Irrelevant states Reachable under optimal control A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  48. Heuristic Search • Used for action selection, not for changing a value function (=heuristic evaluation function) • Backed-up values are computed, but typically discarded • Extension of the idea of a greedy policy — only deeper • Also suggests ways to select states to backup: smart focusing: A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  49. Summary • Emphasized close relationship between planning and learning • Important distinction between distribution models and sample models • Looked at some ways to integrate planning and learning • synergy among planning, acting, model learning • Distribution of backups: focus of the computation • trajectory sampling: backup along trajectories • prioritized sweeping • heuristic search • Size of backups: full vs. sample; deep vs. shallow A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

  50. Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning The Overall Plan A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998

More Related