1 / 21

Bayesian Sparse Sampling for On-line Reward Optimization

Bayesian Sparse Sampling for On-line Reward Optimization. Presented in the Value of Information Seminar at NIPS 2005. Based on previous paper from ICML 2005, written by Dale Schuurmans et all. Background Perspective. Be Bayesian about reinforcement learning

marcus
Download Presentation

Bayesian Sparse Sampling for On-line Reward Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Sparse Sampling for On-line Reward Optimization Presented in the Value of Information Seminar at NIPS 2005 Based on previous paper from ICML 2005, written by Dale Schuurmans et all

  2. Background Perspective • Be Bayesian about reinforcement learning • Ideal representation of uncertainty for action selection • Computational barriers Why are Bayesian approaches not prevalent in RL?

  3. Exploration vs. Exploitation • Bayes decision theory • Value of information measured by ultimate return in reward • Choose actions to max expected value • Exploration/exploitation tradeoff implicitly handled as side effect

  4. Bayesian Approach conceptually clean but computationally disasterous versus conceptually disasterous but computationally clean

  5. Bayesian Approach conceptually clean but computationally disasterous versus conceptually disasterous but computationally clean

  6. Overview • Efficient lookahead search for Bayesian RL • Sparser sparse sampling • Controllable computational cost • Higher quality action selection than current methods Greedy Epsilon - greedy Boltzmann Thompson Sampling Bayes optimal Interval estimation Myopic value of perfect info. Standard sparse sampling Péret & Garcia (Luce 1959) (Thompson 1933) (Hee 1978) (Lai 1987, Kaelbling 1994) (Dearden, Friedman, Andre 1999) (Kearns, Mansour, Ng 2001) (Péret & Garcia 2004)

  7. Sequential Decision Making Requires model P(r,s’|s,a) How to make an optimal decision? V(s) s “Planning” a a MAX Q(s,a) s,a Q(s,a) s,a r r r r expectation expectation V(s’) s’ V(s’) s’ V(s’) s’ V(s’) s’ a’ a’ a’ a’ a’ a’ a’ a’ MAX MAX MAX MAX Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ E E E E E E E E s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ This is: finite horizon, finite action, finite reward case General case: Fixed point equations:

  8. Reinforcement Learning Do not have model P(r,s’|s,a) V(s) s a a MAX s,a Q(s,a) s,a Q(s,a) r r r r expectation expectation V(s’) s’ V(s’) s’ V(s’) s’ V(s’) s’ a’ a’ a’ a’ a’ a’ a’ a’ MAX MAX MAX MAX s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ E E E E E E E E s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’

  9. Cannot Compute Reinforcement Learning Do not have model P(r,s’|s,a) V(s) s a a MAX s,a Q(s,a) s,a Q(s,a) r r r r expectation expectation V(s’) s’ V(s’) s’ V(s’) s’ V(s’) s’ a’ a’ a’ a’ a’ a’ a’ a’ MAX MAX MAX MAX s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) s’,a’ Q(s’,a’) r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ E E E E E E E E s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’ s’’

  10. Bayesian Reinforcement Learning Prior P(θ) on model P(rs’ |sa, θ) Belief state b=P(θ) Meta-level MDP meta-level state s b decision a Choose action to maximize long term reward a actions s b +a outcome r, s’, b’ Meta-level Model P(r,s’b’|s b,a) rewards r s’ b’ decision a’ a’ actions s’ b’ +a’ outcome r’, s’’, b’’ Meta-level Model P(r’,s’’b’’|s’ b’,a’) rewards r’ s’’ b’’ Have a model for meta-level transitions! - based on posterior update and expectations over base-level MDPs

  11. Bayesian RL Decision Making How to make an optimal decision? V(s b) Bayes optimal action selection Solve planning problem in meta-level MDP: a a MAX Q(s b, a) Q(s b, a) - Optimal Q,V values r r r r E E V(s’ b’) V(s’ b’) V(s’ b’) V(s’ b’) Problem: meta-level MDP much larger than base-level MDP Impractical

  12. Bayesian RL Decision Making Current approximation strategies: Consider current belief state b s b s a a a a MAX MAX Draw a base-level MDP s b, a s b, a s, a s, a E E r r r r r r r r E E s’ b’ s’ b’ s’ b’ s’ b’ s’ s’ s’ s’ ☺ Exploration is based on uncertainty Greedy approach: current b → mean base-level MDP model → point estimate for Q, V → choose greedy action Thompson approach: current b → sample a base-level MDP model → point estimate for Q, V (Choose action proportional to probability it is max Q) But doesn’t consider uncertainty But still myopic

  13. Their Approach • Try to better approximate Bayes optimal action selection by performing lookahead • Adapt “sparse sampling” (Kearns, Mansour ,Ng) • Make some practical improvements

  14. Sparse Sampling (Kearns, Mansour, Ng 2001)

  15. Bayesian Sparse SamplingObservation 1 • Action value estimates are not equally important • Need better Q value estimates for some actions but not all • Preferentially expand tree under actions that might be optimal MAX Biased tree growth Use Thompson sampling to select actions to expand

  16. Bayesian Sparse SamplingObservation 2 Correct leaf value estimates to same depth MAX t=1 Use mean MDP Q-value multiplied by remaining depth E t=2 E t=3 effective horizon N=3

  17. Bayesian Sparse SamplingTree growing procedure 1. Sample prior for a model 2. Solve action values 3. Select the optimal action • Descend sparse tree from root • Thompson sample actions • Sample outcome • Until new node added • Repeat until tree size limit reached s,b s,b,a Execute action Observe reward s’b’ Control computation by controlling tree size

  18. Simple experiments • 5 Bernoulli bandits • Beta priors • Sampled model from prior • Run action selection strategies • Repeat 3000 times • Average accumulated reward per step

  19. That's it

More Related