1 / 49

Fast approximate POMDP planning: Overcoming the curse of history!

Fast approximate POMDP planning: Overcoming the curse of history!. Joelle Pineau, Geoff Gordon and Sebastian Thrun, CMU Point-based value iteration: an anytime algorithm for POMDPs Workshop on Advances in Machine Learning - June, 2003. Why use a POMDP?.

sarai
Download Presentation

Fast approximate POMDP planning: Overcoming the curse of history!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast approximate POMDP planning:Overcoming the curse of history! Joelle Pineau, Geoff Gordon and Sebastian Thrun, CMU Point-based value iteration: an anytime algorithm for POMDPs Workshop on Advances in Machine Learning - June, 2003

  2. Why use a POMDP? • POMDPs provide a rich framework for sequential decision-making, which can model: • varying rewards across actions and goals • actions with random effects • uncertainty in the state of the world Workshop on Advances in Machine Learning

  3. Existing applications of POMDPs • Maintenance scheduling • Puterman, 1994 • Robot navigation • Koenig & Simmons, 1995; Roy & Thrun, 1999 • Helicopter control • Bagnell & Schneider, 2001; Ng et al., 2002 • Dialogue modeling • Roy, Pineau & Thrun, 2000; Peak&Horvitz, 2000 • Preference elicitation • Boutilier, 2002 Workshop on Advances in Machine Learning

  4. POMDP Model POMDP is n-tuple { S, A, , T, O, R }: S = state set A = action set  = observation set T(s,a,s’) = state-to-state transition probabilities O(s,a,o) = observation generation probabilities R(s,a) = Reward function st-1 st What goes on: ot-1 ot What we see: at-1 at bt-1 bt What we infer: Workshop on Advances in Machine Learning

  5. Understanding the belief state • A belief is a probability distribution over states Where Dim(B) = |S|-1 • E.g. Let S={s1, s2} 1 P(s1) 0 Workshop on Advances in Machine Learning

  6. Understanding the belief state • A belief is a probability distribution over states Where Dim(B) = |S|-1 • E.g. Let S={s1, s2, s3} 1 P(s1) 0 P(s2) 1 Workshop on Advances in Machine Learning

  7. Understanding the belief state • A belief is a probability distribution over states Where Dim(B) = |S|-1 • E.g. Let S={s1, s2, s3 , s4} 1 P(s3) P(s1) 0 P(s2) 1 Workshop on Advances in Machine Learning

  8. The first curse of POMDP planning • The curse of dimensionality: • dimension of planning problem = # of states • related to the MDP curse of dimensionality Workshop on Advances in Machine Learning

  9. POMDP value functions V(b) = expected total discounted future reward starting from b • Represent V as the upper surface of a set of hyper-planes. • V is piecewise-linear convex • Backup operator T: V  TV V(b) P(s1) b Workshop on Advances in Machine Learning

  10. Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 V0(b) P(s1) b Workshop on Advances in Machine Learning

  11. Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 V1(b) P(s1) b Workshop on Advances in Machine Learning

  12. Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 27 V2(b) P(s1) b Workshop on Advances in Machine Learning

  13. Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 27 3 2187 V2(b) P(s1) b Workshop on Advances in Machine Learning

  14. Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2 Iteration # hyper-planes 0 1 1 3 2 27 3 2187 4 14,348,907 V2(b) P(s1) b Workshop on Advances in Machine Learning

  15. Exact value iteration for POMDPs • Simple problem: |S|=2, |A|=3, ||=2  Many hyper-planes can be pruned away Iteration # hyper-planes 0 1 1 3 2 5 3 9 4 7 5 13 10 27 15 47 20 59 V2(b) P(s1) b Workshop on Advances in Machine Learning

  16. Is pruning sufficient? |S|=20, |A|=6, ||=8 Iteration # hyper-planes 0 1 1 5 2 213 3 ????? … Not for this problem! Workshop on Advances in Machine Learning

  17. Certainly not for this problem! |S|=576, |A|=19, |O|=17 State Features: {RobotLocation, ReminderGoal, UserLocation, UserMotionGoal, UserStatus, UserSpeechGoal} Patient room Robot home Physiotherapy Workshop on Advances in Machine Learning

  18. The second curse of POMDP planning • The curse of dimensionality: • the dimension of each hyper-plane = # of states • The curse of history: • the number of hyper-planes grows exponentially with the planning horizon Workshop on Advances in Machine Learning

  19. The second curse of POMDP planning • The curse of dimensionality: • the dimension of each hyper-plane = # of states • The curse of history: • the number of hyper-planes grows exponentially with the planning horizon dimensionality history Complexity of POMDP value iteration: Workshop on Advances in Machine Learning

  20. s1 s0 s2 Possible approximation approaches • Ignore the belief: • Discretize the belief: • Compress the belief: • Plan for trajectories: - overcomes both curses - very fast - performs poorly in high entropy beliefs [Littman et al., 1995] - overcomes the curse of history (sort of) - scales exponentially with # states [Lovejoy, 1991; Brafman 1997; Hauskrecht, 1998; Zhou&Hansen, 2001] - overcomes the curse of dimensionality [Poupart&Boutilier, 2002; Roy&Gordon, 2002] - can diminish both curses - requires restricted policy class - local minimum, small gradients [Baxter&Bartlett, 2000; Ng&Jordan, 2002] Workshop on Advances in Machine Learning

  21. A new algorithm: Point-based value iteration • Main idea: • Select a small set of belief points V(b) P(s1) b1 b0 b2 Workshop on Advances in Machine Learning

  22. A new algorithm: Point-based value iteration • Main idea: • Select a small set of belief points • Plan for those belief points only V(b) P(s1) b1 b0 b2 Workshop on Advances in Machine Learning

  23. A new algorithm: Point-based value iteration • Main idea: • Select a small set of belief points  Focus on reachable beliefs • Plan for those belief points only V(b) P(s1) b1 b0 b2 a,o a,o Workshop on Advances in Machine Learning

  24. A new algorithm: Point-based value iteration • Main idea: • Select a small set of belief points  Focus on reachable beliefs • Plan for those belief points only  Learn value and its gradient V(b) P(s1) b1 b0 b2 a,o a,o Workshop on Advances in Machine Learning

  25. Point-based value update V(b) P(s1) b1 b0 b2 Workshop on Advances in Machine Learning

  26. Point-based value update • Initialize the value function (…and skip ahead a few iterations) Vn(b) P(s1) b1 b0 b2 Workshop on Advances in Machine Learning

  27. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: Vn(b) P(s1) b Workshop on Advances in Machine Learning

  28. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: Vn(b) P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Workshop on Advances in Machine Learning

  29. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: Vn(b) ba1,o1,ba2,o1 ba1,o2 ba2,o2 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Workshop on Advances in Machine Learning

  30. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: Vn(b) ba1,o1,ba2,o1 ba1,o2 ba2,o2 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 Workshop on Advances in Machine Learning

  31. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: Vn(b) ba1,o1,ba2,o1 ba1,o2 ba2,o2 P(s1) b Workshop on Advances in Machine Learning

  32. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: Vn+1(b) ba2 ba1 P(s1) b Workshop on Advances in Machine Learning

  33. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: • Max over actions: Vn+1(b) ba2 ba1 P(s1) b Workshop on Advances in Machine Learning

  34. Point-based value update • Initialize the value function (…and skip ahead a few iterations) • For each bB: • For each (a,o): Project forward bba,o and find best value: • Sum over observations: • Max over actions: Vn+1(b) P(s1) b1 b0 b2 Workshop on Advances in Machine Learning

  35. Complexity of value update Exact Update Point-based Update I - Projection S2An S2AB II - Sum SAnSAB2 III - Max SAn SAB where: S = # states n = # solution vectors at iteration n A = # actions B = # belief points  = # observations n+1 Workshop on Advances in Machine Learning

  36. A bound on the approximation error • Bound error of the point-based backup operator. • Bound depends on how densely we sample belief points. • Let  be the set of reachable beliefs. • Let B be the set of belief points. Theorem: For any belief set B and any horizon n, the error of the PBVI algorithm n=||VnB-Vn*|| is bounded by: Workshop on Advances in Machine Learning

  37. Experimental results: Lasertag domain State space = RobotPositionOpponentPosition Observable: RobotPosition - always OpponentPosition- only if same as Robot Action space = {North, South, East, West, Tag} Opponent strategy: Move away from robot w/ Pr=0.8 |S|=870, |A|=5, ||=30 Workshop on Advances in Machine Learning

  38. Performance of PBVI on Lasertag domain Opponent tagged 70% of trials Opponent tagged 17% of trials Workshop on Advances in Machine Learning

  39. Performance on well-known POMDPs Maze33 |S|=36, |A|=5, ||=17 Hallway |S|=60, |A|=5, ||=20 Hallway2 |S|=92, |A|=5, ||=17 Method QMDP Grid PBUA PBVI Reward 0.198 0.94 2.30 2.25 Time(s) 0.19 n.v. 12166 3448 B - 174 660 470 %Goal 47 n.v 100 95 Reward 0.261 n.v. 0.53 0.53 Time(s) 0.51 n.v. 450 288 B - n.v. 300 86 %Goal 22 98 100 98 Reward 0.109 n.v. 0.35 0.34 Time(s) 1.44 n.v. 27898 360 B - 337 1840 95 Workshop on Advances in Machine Learning

  40. Selecting good belief points • What can we learn from policy search methods? • Focus on reachable beliefs. a1,o1 a2,o1 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 a2,o2 a1,o2 Workshop on Advances in Machine Learning

  41. Selecting good belief points • What can we learn from policy search methods? • Focus on reachable beliefs. • How can we avoid including all reachable beliefs? • Reachability analysis considers all actions, but stochastic observation choice. a2,o1 P(s1) ba1,o1 ba2,o1 ba2,o2 b ba1,o2 a1,o2 Workshop on Advances in Machine Learning

  42. Selecting good belief points • What can we learn from policy search methods? • Focus on reachable beliefs. • How can we avoid including all reachable beliefs? • Reachability analysis considers all actions, but stochastic observation choice. • What can we learn from our error bound? • Select widely-spaced beliefs, rather than near-by beliefs. a2,o1 P(s1) ba2,o1 b ba1,o2 a1,o2 Workshop on Advances in Machine Learning

  43. Validation of the belief expansion heuristic • Hallway domain: |S|=60, |A|=5, ||=20 Workshop on Advances in Machine Learning

  44. Validation of the belief expansion heuristic • Tag domain: |S|=870, |A|=5, ||=30 Workshop on Advances in Machine Learning

  45. The anytime PBVI algorithm • Alternate between: • Growing the set of belief point (e.g. B doubles in size everytime) • Planning for those belief points • Terminate when you run out of time or have a good policy. Workshop on Advances in Machine Learning

  46. The anytime PBVI algorithm • Alternate between: • Growing the set of belief point (e.g. B doubles in size everytime) • Planning for those belief points • Terminate when you run out of time or have a good policy. • Lasertag results: • 13 phases: |B|=1334 • ran out of time! Workshop on Advances in Machine Learning

  47. The anytime PBVI algorithm • Alternate between: • Growing the set of belief point (e.g. B doubles in size everytime) • Planning for those belief points • Terminate when you run out of time or have a good policy. • Lasertag results: • 13 phases: |B|=1334 • ran out of time! • Hallway2 results: • 8 phases: |B|=95 • found good policy. Workshop on Advances in Machine Learning

  48. Summary • POMDPs suffer from the curse of history • # of beliefs grows exponentially with the planning horizon • PBVI addresses the curse of history by limiting planning to a small set of likely beliefs. • Strengths of PBVI include: • anytime algorithm; • polynomial-time value updates; • bounded approximation error; • empirical results showing we can solve problems up to 870 states. Workshop on Advances in Machine Learning

  49. Recent work • Current hurdle to solving even larger POMDPs: PBVI complexity is O(S2AB + SAB2) • Addressing S2: • Combine PBVI with belief compression techniques. But sparse transition matrices mean: S2  S • Addressing B2: • Use ball-trees to structure belief points. • Find better belief selection heuristics. Workshop on Advances in Machine Learning

More Related