Download Presentation
## Monte-Carlo Methods

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Monte-Carlo Methods**Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]**Differences with DP/TD**• Differences with DP methods: • Real RL: Complete transition model not necessary • They sample experience; can be used for direct learning • They do not bootstrap • No evaluation of successor states • Differences with TD methods • Well, they do not bootstrap • they average episodic returns Slides prepared by Georgios Chalkiadakis**Overview and Advantages**• Learn from experience – sample episodes • Sample sequences of states, actions, rewards • Either on-line, or from simulated (model-based) interactions with environment. • But no complete model required. • Advantages • Provably learn optimal policy without model • Can be used with sample /easy-to-produce models • Can focus on interesting state regions easily • More robust wrt Markov property violations Slides prepared by Georgios Chalkiadakis**Policy Evaluation**Slides prepared by Georgios Chalkiadakis**Action-value functions required**• Without a model, we need Q-value estimates • MC methods now average returns following visits to state-action pairs • All such pairs “need” to be visited! • …sufficient exploration required • Randomize episode starts (“exploring-starts”) • …or behave using a stochastic (e.g. ε-greedy) policy • …thus “Monte-Carlo” Slides prepared by Georgios Chalkiadakis**Monte-Carlo Control (to generate optimal policy)**• For now, assume “exploring starts” • Does “policy iteration” work? • Yes! Where evaluation of each policy is over multiple episodes And improvement make policy greedy wrt current Q-value function Slides prepared by Georgios Chalkiadakis**Monte-Carlo Control (to generate optimal policy)**• Why? is greedy wrt • Then, policy-improvement theorem applies because, for all s : is uniformly better than Thus Slides prepared by Georgios Chalkiadakis**A Monte-Carlo control algorithm**Slides prepared by Georgios Chalkiadakis**What about ε-greedy policies?**ε-greedy Exploration • If not “greedy”, select with Otherwise: Slides prepared by Georgios Chalkiadakis**Yes, policy iteration works**• See the details in book • ε-soft on-policy algorithm:**…and you can have off-policy learning as well…**• Why? Slides prepared by Georgios Chalkiadakis