Monte-Carlo Methods - PowerPoint PPT Presentation

monte carlo methods n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Monte-Carlo Methods PowerPoint Presentation
Download Presentation
Monte-Carlo Methods

play fullscreen
1 / 11
Monte-Carlo Methods
89 Views
Download Presentation
marie
Download Presentation

Monte-Carlo Methods

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

  2. Differences with DP/TD • Differences with DP methods: • Real RL: Complete transition model not necessary • They sample experience; can be used for direct learning • They do not bootstrap • No evaluation of successor states • Differences with TD methods • Well, they do not bootstrap • they average episodic returns Slides prepared by Georgios Chalkiadakis

  3. Overview and Advantages • Learn from experience – sample episodes • Sample sequences of states, actions, rewards • Either on-line, or from simulated (model-based) interactions with environment. • But no complete model required. • Advantages • Provably learn optimal policy without model • Can be used with sample /easy-to-produce models • Can focus on interesting state regions easily • More robust wrt Markov property violations Slides prepared by Georgios Chalkiadakis

  4. Policy Evaluation Slides prepared by Georgios Chalkiadakis

  5. Action-value functions required • Without a model, we need Q-value estimates • MC methods now average returns following visits to state-action pairs • All such pairs “need” to be visited! • …sufficient exploration required • Randomize episode starts (“exploring-starts”) • …or behave using a stochastic (e.g. ε-greedy) policy • …thus “Monte-Carlo” Slides prepared by Georgios Chalkiadakis

  6. Monte-Carlo Control (to generate optimal policy) • For now, assume “exploring starts” • Does “policy iteration” work? • Yes! Where evaluation of each policy is over multiple episodes And improvement  make policy greedy wrt current Q-value function Slides prepared by Georgios Chalkiadakis

  7. Monte-Carlo Control (to generate optimal policy) • Why? is greedy wrt • Then, policy-improvement theorem applies because, for all s : is uniformly better than Thus Slides prepared by Georgios Chalkiadakis

  8. A Monte-Carlo control algorithm Slides prepared by Georgios Chalkiadakis

  9. What about ε-greedy policies? ε-greedy Exploration • If not “greedy”, select with Otherwise: Slides prepared by Georgios Chalkiadakis

  10. Yes, policy iteration works • See the details in book • ε-soft on-policy algorithm:

  11. …and you can have off-policy learning as well… • Why? Slides prepared by Georgios Chalkiadakis