1 / 21

Chapter 5: Monte Carlo Methods

Chapter 5: Monte Carlo Methods. Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Monte Carlo methods learn directly from experience On-line: No model necessary and still attains optimality Simulated: No need for a full model. 1. 2. 3. 4. 5.

thuy
Download Presentation

Chapter 5: Monte Carlo Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 5: Monte Carlo Methods • Monte Carlo methods learn from complete sample returns • Only defined for episodic tasks • Monte Carlo methods learn directly from experience • On-line: No model necessary and still attains optimality • Simulated: No need for a full model R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  2. 1 2 3 4 5 Monte Carlo Policy Evaluation • Goal: learn Vp(s) • Given: some number of episodes under p which contain s • Idea: Average returns observed after visits to s • Every-Visit MC: average returns for every time s is visited in an episode • First-visit MC: average returns only for first time s is visited in an episode • Both converge asymptotically R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  3. First-visit Monte Carlo Policy Evaluation • Initialize: • p policy to be evaluated • V an arbitrary state-value function • Returns(s) empty list, for all seS • Repeat forever: • Generate an episode using p • For each state s appearing in the episode • R return following the first occurrence of s • Append R to Return(s) • V(s) average(Returns(s)) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  4. Blackjack example • Object: Have your card sum be greater than the dealers without exceeding 21. • States (200 of them): • current sum (12-21) • dealer’s showing card (ace-10) • do I have a useable ace? • Reward: +1 for winning, 0 for a draw, -1 for losing • Actions: stick (stop receiving cards), hit (receive another card) • Policy: Stick if my sum is 20 or 21, else hit R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  5. Blackjack Value Functions • After many MC state visit evaluations the state-value function is well approximated • Dynamic Programming parameters difficult to formulate here! For instance with a given hand and a decision to stay what would be the expected return value? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  6. Backup Diagram for Monte Carlo • Entire episode is included while in DP only one step transitions • Only one choice at each state (unlike DP which uses all possible transitions in one step) • Estimates for all states are independent so MC does not bootstrap (build on other estimates) • Time required to estimate one state does not depend on the total number of states R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  7. The Power of Monte Carlo Example - Elastic Membrane (Dirichlet Problem) How do we compute the shape of the membrane or bubble attached to a fixed frame? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  8. Two Approaches Relaxation Iterate on the grid and compute averages (like DP iterations) Kakutani’s algorithm, 1945 Use many random walks and average the boundary points values (like MC approach) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  9. Monte Carlo Estimation of Action Values (Q) • Monte Carlo is most useful when a model is not available • We want to learn Q* • Qp(s,a) - average return starting from state s and action a following p • Converges asymptoticallyif every state-action pair is visited • To assure this we must maintain exploration to visit many state-action pairs • Exploring starts: Every state-action pair has a non-zero probability of being the starting pair R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  10. Monte Carlo Control (to approximate an optimal policy) • MC policy iteration:Policy evaluation by approximating Qp using MC methods, followed by policy improvement • Policy improvement step: greedy with respect to Qp (action-value) function – no model needed to construct greedy policy Generalized Policy Iteration GPI R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  11. Convergence of MC Control • Policy improvement theorem tells us: • This assumes exploring starts and infinite number of episodes for MC policy evaluation • To solve the latter: • update only to a given level of performance • alternate between evaluation and improvement per episode R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  12. Monte Carlo Exploring Starts Fixed point is optimal policy p* Proof is one of the fundamental questions of RL R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  13. Blackjack Example continued • Exploring starts – easy to enforce by generating state-action pairs randomly • Initial policy as described before (sticks only at 20 or 21) • Initial action-value function equal to zero R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  14. greedy action non-max actions On-policy Monte Carlo Control • On-policy: learn or improve the policy currently executing • How do we get rid ofexploring starts? • Need soft policies: p(s,a) > 0 for all s and a • e.g. e-soft policy • probability of action selection: • Similar to GPI: move policy towards greedy policy (i.e. e-soft) • Converges to best e-soft policy R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  15. On-policy MC Control R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  16. Learning about p while following another R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  17. Off-policy Monte Carlo control • On-policy estimates the policy value while following it • so the policy to generate behavior and estimate its result is identical • Off-policy assumes separate behavior and estimation policies • Behavior policy generates behavior in environment • may be randomized to sample all actions • Estimation policy is policy being learned about • may be deterministic (greedy) • Off-policy estimates one policy while following another • Average returns from behavior policy by using nonzero probabilities of selecting all actions for the estimation policy • May be slow to find improvement after nongreedy action R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  18. Off-policy MC control R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  19. Incremental Implementation • MC can be implemented incrementally • saves memory • Compute the weighted average of each return non-incremental incremental equivalent R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  20. States: grid squares, velocity horizontal and vertical Rewards: -1 on track, -5 off track Only the right turns allowed Actions: +1, -1, 0 to velocity 0 < Velocity < 5 Stochastic: 50% of the time it moves 1 extra square up or right Racetrack Exercise R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

  21. Summary • MC has several advantages over DP: • Can learn directly from interaction with environment • No need for full models • No need to learn about ALL states • Less harm by Markovian violations (later in book) • MC methods provide an alternate policy evaluation process • One issue to watch for: maintaining sufficient exploration • exploring starts, soft policies • No bootstrapping (as opposed to DP) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

More Related