1 / 58

Decision Making in Intelligent Systems Lecture 5

Decision Making in Intelligent Systems Lecture 5. BSc course Kunstmatige Intelligentie 2007 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl. Overview of this lecture. Solving the full RL problem

nura
Download Presentation

Decision Making in Intelligent Systems Lecture 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision Making in Intelligent SystemsLecture 5 BSc course Kunstmatige Intelligentie 2007 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl

  2. Overview of this lecture • Solving the full RL problem • Given that we do not have the MDP model • Temporal Difference (TD) methods • Making TD methods more efficient with eligibility traces • Making TD methods more efficient with function approximation

  3. T T T T T T T T T T T T T T T T T T T T Simplest TD Method

  4. TD methods bootstrap and sample • Bootstrapping: update involves an estimate • MC does not bootstrap • DP bootstraps • TD bootstraps • Sampling: update does not involve an expected value • MC samples • DP does not sample • TD samples

  5. Learning An Action-Value Function

  6. Sarsa: On-Policy TD Control Turn TD into a control method by always updating the policy to be (epsilon-)greedy with respect to the current estimate. S A R S A: State Action Reward State Action

  7. Windy Gridworld undiscounted, episodic, reward = –1 until goal

  8. Results of Sarsa on the Windy Gridworld

  9. Q-Learning: Off-Policy TD Control

  10. Cliffwalking • e-greedy, e = 0.1

  11. Summary one-step tabular TD methods • Introduced one-step tabular model-free TD methods • Extend prediction to control by employing some form of GPI • On-policy control: Sarsa • Off-policy control: Q-learning • These methods bootstrap and sample, combining aspects of DP and MC methods

  12. End of first part of the course • This concludes the material for the mid term exam (chapter 1-6 of Sutton & Barto) • Deeltoets, 25-3, 9-12u, gebouw B/B-B (Nieuwe Achtergracht 166, tentamenzaal)

  13. The Book • Part I: The Problem • Introduction • Evaluative Feedback • The Reinforcement Learning Problem • Part II: Elementary Solution Methods • Dynamic Programming • Monte Carlo Methods • Temporal Difference Learning • Part III: A Unified View • Eligibility Traces • Generalization and Function Approximation • Planning and Learning • Dimensions of Reinforcement Learning • Case Studies

  14. TD(l) • New variable called eligibility trace e • On each step, decay all traces by gl and increment the trace for the current state by 1 • Accumulating trace

  15. Prediction: On-line Tabular TD(l)

  16. Standard one-step TD • l=0 standard one-step TD-learning = TD(0)

  17. Eligibility traces backward view • Shout dt backwards over time • The strength of your voice decreases with temporal distance by gl

  18. Control: Sarsa(l) • Save eligibility for state-action pairs instead of just states

  19. Sarsa(l) Algorithm

  20. Sarsa(l) Gridworld Example • With one trial, the agent has much more information about how to get to the goal • not necessarily the best way • Can considerably accelerate learning

  21. Watkins’s Q(l)

  22. Replacing Traces • Using accumulating traces, frequently visited states can have eligibilities greater than 1 • This can be a problem for convergence • Replacing traces: Instead of adding 1 when you visit a state, set that trace to 1

  23. Replacing Traces Example • Same 19 state random walk task as before • Replacing traces perform better than accumulating traces over more values of l

  24. Overview of this lecture • Solving the full RL problem • Given that we do not have the MDP model • Temporal Difference (TD) methods • Making TD methods more efficient with eligibility traces • Making TD methods more efficient with function approximation

  25. Generalization and Function Approximation • Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. • Overview of function approximation (FA) methods and how they can be adapted to RL

  26. Generalization illustration Table Generalizing Function Approximator State V State V s s s . . . s 1 2 3 Train here N

  27. Generalization illustration cont. So with function approximation a single value update affects a larger region of the state space

  28. Value Prediction with FA Before, value functions were stored in lookup tables.

  29. Value Prediction with FA

  30. Adapt Supervised Learning Algorithms Training Info = desired (target) outputs Supervised Learning System Inputs Outputs Training example = {input (state), target output} Error = (target output – actual output)

  31. Backups as Training Examples As a training example: input target output

  32. Any FA Method? • In principle, yes: • artificial neural networks • decision trees • multivariate regression methods • etc. • But RL has some special requirements: • usually want to learn while interacting • ability to handle “moving targets”

  33. Gradient Descent Methods

  34. Performance Measures • Many are applicable but… • a common and simple one is the mean-squared error (MSE) over a distribution P : • P is the distribution of states at which backups are done. • The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution.

  35. Gradient Descent Iteratively move down the gradient:

  36. Gradient Descent Cont. For the MSE given above and using the chain rule:

  37. Gradient Descent Cont. Use just the sample gradient instead: Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum of the MSE if a decreases appropriately with t.

  38. But We Don’t have these Targets

  39. What about TD(l) Targets?

  40. On-Line Gradient-Descent TD(l)

  41. Linear Methods

  42. Nice Properties of Linear FA Methods • The gradient is very simple: • Linear gradient descent TD(l) converges: • Step size decreases appropriately • On-line sampling (states sampled from the on-policy distribution) • Converges to parameter vector with property: best parameter vector (Tsitsiklis & Van Roy, 1997)

  43. Learning state-action values The general gradient-descent rule: Gradient-descent Sarsa(l) (backward view): Training examples of the form: Control with FA

  44. GPI Linear Gradient Descent Watkins’ Q(l)

  45. GPI with Linear Gradient Descent Sarsa(l)

  46. States as feature vectors But how should the state features be constructed?

  47. Coarse Coding

  48. Shaping Generalization in Coarse Coding

  49. Tile Coding • Binary feature for each tile • Number of features present at any one time is constant • Binary features means weighted sum easy to compute • Easy to compute indices of the features present

  50. Tile Coding Cont. Irregular tilings CMAC “Cerebellar model arithmetic computer” Albus 1971

More Related