1 / 40

Approximate POMDPs using Point-based Value Iteration

Approximate POMDPs using Point-based Value Iteration. Ryan Luna 21 March 2013. The More You Know. G. Shani, J. Pineau and R. Kaplow. A Survey of Point-based POMDP Solvers. Autonomous Agents and Multi-agent Systems . 2012.

cargan
Download Presentation

Approximate POMDPs using Point-based Value Iteration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate POMDPs using Point-based Value Iteration Ryan Luna 21 March 2013

  2. The More You Know G. Shani, J. Pineau and R. Kaplow. A Survey of Point-based POMDP Solvers. Autonomous Agents and Multi-agent Systems. 2012. T. Smith and R. Simmons. Heuristic Search Value Iteration for POMDPs. Uncertainty in Artificial Intelligence. 2004. J. Pineau, G. Gordon and S. Thrum. Point-based Value Iteration: An anytime algorithm for POMDPs. Int’l Joint Conferences in Artificial Intelligence. 2003. S. Thrun, W. Burgard. and D. Fox. Probabilistic Robotics. MIT Press. 2006.

  3. POMDP • Solving a POMDP is very similar to an MDP • The similarities: • State transitions are still stochastic • Value function is a function of our current “state” • We still perform Bellman backups to compute V • The differences: • We have a probability distribution of where we are • We can make (stochastic) observations of our current belief

  4. measurements state x1 action u3 state x2 measurements actions u1, u2 payoff payoff Let’s solve a POMDP!

  5. Make a decision… NOW

  6. Make a decision… NOW

  7. Make a decision… NOW

  8. You Said We Were Solving A POMDP • Fine. We sense z1Now what? • We have gained information. Update our value function! p(z1 | x1) = 0.7p(z1 | x2) = 0.3

  9. I’m a Beliefer V1(b) b’(b | z1) V1(b | z1)

  10. More Formally (AKA maths)

  11. HEY! You said POMDP… • Geez. OK. We don’t know that we observed z1. We must compute expected value.

  12. HEY! You said POMDP… • Geez. OK. We don’t know that we observed z1. We must compute expected value.

  13. HEY! You said POMDP… • Geez. OK. We don’t know that we observed z1. We must compute expected value.

  14. Value of Sensing Before sensing After Sensing

  15. Belief Update

  16. Lather, Rinse, Repeat • We just did a full backup for T=1! • Repeat for T=2.

  17. Value of Sensing

  18. Curse of Sensing

  19. The POMDP Hex • POMDPs have both the curse of dimensionality and the curse of history. • Scholars maintain that history is the truly unmanageable part. O (|V| x |A| x |Ω| x |S|2 + |A| x |S| x |V||Ω| ) Belief Update (taking an action) Value Backups (sensor measurements)

  20. The POMDP Hex • T = 1; |V| = 4 • T = 3; |V| ≈ 64 • T = 20; |V| ≈ 10547864 • T = 30; |V| ≈ 10561012337 T = 30

  21. So, How can we address this? • You fly to Paris to enjoy a nice croque monsieur

  22. Here we go • Point-based Value Iteration • Heuristic-search Value Iteration

  23. Point-based Value Iteration • Addresses two major concerns: • Value Iteration addresses all possible beliefs equally, no matter how absurd • Exact value function is probably unnecessary • How do we do this? • Maintain a set of beliefs over which the value function is computed • Only incorporate values which maximize at least one member of the belief set • Focus search on most probable beliefs

  24. Point-based Value Iteration • Maintains a fixed set of belief points, B • The value function is computed only over B • B only contains reachable beliefs • The number of constraints (|V|) is fixed at |B| • Value updates are now PTIME • PBVI provides an anytime solution that converges to optimal value function • The error in the value function is bounded

  25. What does it Mean? • Start with a small number of reachable beliefs • Point-based backup instead of Bellman backup • Implicitly prunes value simplex • Increase # beliefs until timeout O (|V| x |A| x |Ω| x |S|2 + |B| x |A| x |S| x |Ω|)

  26. Heuristic Search Value Iteration • Popular flavor of PBVI • Upper and lower bound on value estimate • Uses several powerful heuristics • Anytime. Gets arbitrarily close* to optimal • Performs a depth-first search into belief space • Depth is bounded

  27. Upper and Lower Bounds! • Lower bound is the standard vector simplex • Upper bound is a set of belief/value points • Search ends whendifference in bounds atinitial belief is < ε • Initialization? • Lower bound is easy • Upper bound is a solution to the MDP

  28. We are doing Heuristic Search • Startling observation: the quality of the value estimate at a successor affects its predecessor • width = difference in upper and lower bounds • Want to choose successors which minimize width at initial belief • What does this mean for successors? • We have to pick observations and actions

  29. IE-Max Heuristic • OK. Pick an action… • Select the one with the max upper bound

  30. Excess Uncertainty Heuristic • To complete the deal, we need an observation

  31. Search Depth • Depth-first search strikes fear into the hearts of even the strongest men • Our search is bounded at depth t… phew • Once we expand a belief that satisfies this eqn, the search ceases

  32. Updating the Value Function • When search ceases, perform full Bellman backups in reverse order • Just insert the constraint vector for the l.b. • Update u.b. based on expected value • Both bounds areuniformly improvable

  33. Wait, wat? b0 an a0 t = 0 z0 z0 zm zm z1 z1 t = 1 t = k

  34. Wait, wat? Max “excess uncertainty” b0 Max upper bound an a0 t = 0 z0 z0 zm zm z1 z1 t = 1 t = k

  35. Wait, wat? b0 an a0 t = 0 z0 z0 zm zm z1 z1 t = 1 t = k

  36. Properties of HSVI • Upper and lower bounds monotonically converge to optimal value function • Local updates preserve improvability • Maximum regret is ε • Finite search depth • Finite depth → finite Bellman updates

  37. Rock-Sample • Deterministic motions • Noisy sensor forRock-goodness • +10 for sampling good • -10 for sampling bad • +10 for exiting • No other cost/reward

  38. Rock-Sample

  39. More Results • Lots of comparisons in the original paper

  40. What did we learn • Exact POMDP solution utterly infeasible • Even for the tiniest problem ever • History is (probably) worse than dimensionality • Approximate solutions have better properties • Anytime POMDP solutions are possible • Scales to hundreds or even thousands of states • The problem is really really really really hard

More Related