Create Presentation
Download Presentation

Download Presentation
## Optimal Learning & Bayes -Adaptive MDPs

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Optimal Learning &Bayes-Adaptive MDPs**An Overview Slides on M. Duff’s Thesis/Ch.1 SDM-RG, Mar-09 Slides prepared by Georgios Chalkiadakis**Optimal Learning: Overview**Slides prepared by Georgios Chalkiadakis Behaviour that maximizes expected total reward while interacting with an uncertain world. Behave well while learning, learn while behaving well.**Optimal Learning: Overview**Slides prepared by Georgios Chalkiadakis • What does it mean to behave optimally under uncertainty? • Optimality is defined with respect to a distribution of environments. • Explore vs. Exploit given prior uncertainty regarding environments • What is the “value of information”?**Optimal Learning: Overview**Slides prepared by Georgios Chalkiadakis Bayesian approach: Evolve uncertainty about unknown process parameters The parameters describe prior distributions about the world model (transitions/rewards) That is, about information states**Optimal Learning: Overview**Slides prepared by Georgios Chalkiadakis The sequential problem is described by a “hyperstate”-MDP (“Bayes-Adaptive MDP”): Instead of just physical states physical states+ information states**Simple “stateless” example**Slides prepared by Georgios Chalkiadakis Bernoulli process parameters θ1 θ2 describe the actual (but unknown) probabilities of success Bayesian: Uncertainty about parameters describe it by conjugate prior distributions:**Conjugate Priors**Slides prepared by Georgios Chalkiadakis A prior is conjugate to a likelihood function if the posterior is in the same family with the prior Prior in the family, posterior in the family A simple update of hyperparameters is enough to get the posterior!**Information-state**transition diagram**It simply becomes:**Slides prepared by Georgios Chalkiadakis**Bellman optimality equation (with k steps to go)**Slides prepared by Georgios Chalkiadakis**Enter physical states (MDPs)**Slides prepared by Georgios Chalkiadakis 2 physical states**Enter physical states (MDPs)**Slides prepared by Georgios Chalkiadakis 2 physical states / 2 actions Four Bernoulli processes: 1 at 1, 2 at 1, 1 at 2, 2 at 2 (a_1^1, b_1^1) hyperparameters of beta distribution capturing uncertainty about p^1_{11} full hyperstate: Note: we now have to be in a specific physical state to sample a related process**More than 2 physical states…What priors now?**Slides prepared by Georgios Chalkiadakis • Dirichlet – conjugate to the multinomial sampling • Sampling is now multinomial : s many s’ • We will see examples in future readings…**Certainty equivalence?**• Or, even simpler, consider an horizon of 1 • Compute DP “optimal” policies using means of current belief distributions • perform action, ob • Or, even more simply, use a myopic c-e approach: • Use means of current priors to compute DP optimal policies • Execute “optimal” action, observe transition • Update distribution, repeat Slides prepared by Georgios Chalkiadakis • Truncate the horizon • Compute terminal values using means • …and proceed with a receding-horizon approach • Perform DP , take first “optimal” action, shift window fwd, repeat**No, it’s not a good idea!...**Slides prepared by Georgios Chalkiadakis Actions / state transitions might be starved forever, …even if the initial prior is an accurate model of uncertainty!**Example**Slides prepared by Georgios Chalkiadakis**So, we have to be properly Bayesian**Slides prepared by Georgios Chalkiadakis • If the prior is an accurate model of uncertainty, “important” actions/states will not be starved • There exists Bayesian RL algorithms that do a more than a decent job! (future readings) • However, if the prior provides a distorted picture of reality, then we can have no convergence guarantees • …but “optimal learning” is still in place (assuming that other algorithms operate with the same prior knowledge)