1 / 48

Decision Making in Intelligent Systems Lecture 4

Decision Making in Intelligent Systems Lecture 4. BSc course Kunstmatige Intelligentie 2008 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl. Overview of this lecture. Solving the full RL problem

alena
Download Presentation

Decision Making in Intelligent Systems Lecture 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision Making in Intelligent SystemsLecture 4 BSc course Kunstmatige Intelligentie 2008 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl

  2. Overview of this lecture • Solving the full RL problem • Given that we have the MDP model • Dynamic Programming (DP) • Given that we do not have the MDP model • Monte Carlo (MC) methods • Temporal Difference (TD) methods

  3. Summary of last lecture: dynamic programming • Policy evaluation: estimate value of current policy • Policy improvement: determine new current policy by being greedy w.r.t. current value function • Policy iteration: alternate the above two processes • Value iteration: combine the above two processes in 1 step • Bootstrapping: updating estimates based on your own estimates • Full backups (to be contrasted with sample backups) • Generalized Policy Iteration (GPI)

  4. Monte Carlo Methods • Monte Carlo methods learn • Learn from complete sample returns • Only defined for episodic tasks • Monte Carlo methods learn directly from experience • On-line: No model necessary and still attains optimality • Simulated: No need for a full model

  5. 1 2 3 4 5 Monte Carlo Policy Evaluation • Goal: learn Vp(s) • Given: some number of episodes under p which contain s • Idea: Average returns observed after visits to s • Every-Visit MC: average returns for every time s is visited in an episode • First-visit MC: average returns only for first time s is visited in an episode • Both converge asymptotically

  6. T T T T T T T T T T T T T T T T T T T T Monte Carlo

  7. T T T T T T T T T T cf. Dynamic Programming T T T

  8. First-visit Monte Carlo policy evaluation

  9. Blackjack example • Object: Have your card sum be greater than the dealers without exceeding 21. • States (200 of them): • current sum (12-21) • dealer’s showing card (ace-10) • do I have a useable ace? • Reward: +1 for winning, 0 for a draw, -1 for losing • Actions: stick (stop receiving cards), hit (receive another card) • Initial Policy: Stick if my sum is 20 or 21, else hit

  10. Blackjack value functions

  11. Monte Carlo Estimation of Action Values (Q) • Monte Carlo is most useful when a model is not available • We want to values learn Q* : state-action values • Qp(s,a) - average return starting from state s and action a following p • Also converges asymptotically if every state-action pair is visited • Exploring starts: Every state-action pair has a non-zero probability of being the starting pair

  12. Monte Carlo Control • MC policy iteration: Policy evaluation using MC methods followed by policy improvement • Policy improvement step: greedify with respect to value (or action-value) function

  13. Monte Carlo Exploring Starts

  14. Blackjack example continued • Exploring starts • Initial policy as described before

  15. On-policy Monte Carlo Control • On-policy: learn about policy currently executing • So, learn about policy which explores • Simplest: policy with exploring starts • More sophisticated: soft policies: p(s,a) > 0 for all s and a

  16. Off-policy Monte Carlo control • Behavior policy generates behavior in environment • Estimation policy is policy being learned about • Usually: try to estimate best deterministic policy, based on policy which keeps exploring • Weigh returns from behavior policy to do value estimation of states of the estimation policy • Weigh by estimating relative probabilities of this state-action pair occurring

  17. Learning about p while following

  18. Off-policy MC control

  19. Incremental Implementation • MC can be implemented incrementally • saves memory • Compute the weighted average of each return non-incremental incremental equivalent

  20. Simple incremental implementation target: the actual return after time t

  21. Summary of MC • MC has several advantages over DP: • Can learn directly from interaction with environment • No need for full models • No need to learn about ALL states • Less harm by Markovian violations (later in book) • MC methods provide an alternate policy evaluation process • One issue to watch for: maintaining sufficient exploration • exploring starts, soft policies • Introduced distinction between on-policy and off-policy methods • No bootstrapping (as opposed to DP)

  22. Overview of this lecture • Solving the full RL problem • Given that we have the MDP model • Dynamic Programming (DP) • Given that we do not have the MDP model • Monte Carlo (MC) methods • Temporal Difference (TD) methods

  23. Temporal Difference (TD) methods • Learn from experience, like MC • Can learn directly from interaction with environment • No need for full models • Estimate values based on estimated values of next states, like DP • Bootstrapping (like DP) • Issue to watch for: maintaining sufficient exploration

  24. Bootstrapping A full policy-evaluation backup (DP): The new estimated value for each state s is based on the estimated old value of all possible successor states s’ Bootstrapping: estimating values based on your own estimates of values

  25. TD Prediction Policy Evaluation (the prediction problem): for a given policy p, compute the state-value function Recall: target: the actual return after time t target: an estimate of the return

  26. T T T T T T T T T T T T T T T T T T T T Simple Monte Carlo

  27. T T T T T T T T T T Dynamic Programming T T T

  28. T T T T T T T T T T T T T T T T T T T T Simplest TD Method

  29. TD methods bootstrap and sample • Bootstrapping: update involves an estimate • MC does not bootstrap • DP bootstraps • TD bootstraps • Sampling: update does not involve an expected value • MC samples • DP does not sample • TD samples

  30. Driving Home • Changes recommended by Monte Carlo methods (a=1) • Changes recommended • by TD methods (a=1)

  31. Advantages of TD Learning • TD methods do not require a model of the environment, only experience • TD, but not MC, methods can be fully incremental • You can learn before knowing the final outcome • Less memory • Less peak computation • You can learn without the final outcome • From incomplete sequences • Both MC and TD converge (under certain assumptions to be detailed later), but which is faster?

  32. Random Walk Example • Values learned by TD(0) after • various numbers of episodes

  33. TD and MC on the Random Walk • Data averaged over • 100 sequences of episodes

  34. Optimality of TD(0) Batch Updating: Compute updates according to TD(0), but only update estimates after each complete pass through the data (end of the episode). Let’s assume we train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small a. Constant-a MC also converges under these conditions, but to a difference answer!

  35. You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0

  36. You are the Predictor

  37. You are the Predictor • The prediction that best matches the training data is V(A)=0 • This minimizes the mean-square-error on the training set • This is what a batch Monte Carlo method gets • If we consider the sequentiality of the problem, then we would set V(A)=.75 • This is correct for the maximum likelihood estimate of a Markov model generating the data • i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) • This is called the certainty-equivalence estimate • This is what TD(0) gets

  38. Learning An Action-Value Function

  39. Sarsa: On-Policy TD Control Turn TD into a control method by always updating the policy to be (epsilon-)greedy with respect to the current estimate. S A R S A: State Action Reward State Action

  40. Windy Gridworld undiscounted, episodic, reward = –1 until goal

  41. Results of Sarsa on the Windy Gridworld

  42. Q-Learning: Off-Policy TD Control

  43. Cliffwalking • e-greedy, e = 0.1

  44. The Book • Part I: The Problem • Introduction • Evaluative Feedback • The Reinforcement Learning Problem • Part II: Elementary Solution Methods • Dynamic Programming • Monte Carlo Methods • Temporal Difference Learning • Part III: A Unified View • Eligibility Traces • Generalization and Function Approximation • Planning and Learning • Dimensions of Reinforcement Learning • Case Studies

  45. Unified View

  46. Summary • TD prediction • Introduced one-step tabular model-free TD methods • Extend prediction to control by employing some form of GPI • On-policy control: Sarsa • Off-policy control: Q-learning • These methods bootstrap and sample, combining aspects of DP and MC methods

  47. Questions • What is common to all three classes of methods? – DP, MC, TD • What are the principle strengths and weaknesses of each? • In what sense is our RL view complete? • In what senses is it incomplete? • What are the principal things missing? • The broad applicability of these ideas… • What does the term bootstrapping refer to? • What is the relationship between DP and learning?

  48. Next Class • Lecture next Monday, 5 March : • Chapters 6 & 7 of Sutton & Barto

More Related