1 / 68

Solving POMDPs through Macro Decomposition

Solving POMDPs through Macro Decomposition. Larry Bush Tony Jimenez Brian Bairstow. Outline. Introduction to POMDPs Brian Bairstow Demonstration of POMDPs Larry Bush Approximating POMDPs with Macro Actions Tony Jimenez. Introduction to POMDPs. Introduction to POMDPs

dionne
Download Presentation

Solving POMDPs through Macro Decomposition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow

  2. Outline • Introduction to POMDPs • Brian Bairstow • Demonstration of POMDPs • Larry Bush • Approximating POMDPs with Macro Actions • Tony Jimenez

  3. Introduction to POMDPs • Introduction to POMDPs • Markov Decision Processes (MDPs) • Value Iteration • Partially Observable Markov Decision Processes (POMDPs) • Overview of Techniques • Demonstration of POMDPs • Approximating POMDPs with Macro Actions

  4. Navigation of a Building • Robot • Knows building map • Wants to reach goal • Uncertainty in actions • What is the best way to get there?

  5. Model States, S Actions, A Transition Probabilities, p(s,a) Rewards, r(s,a) Process Observe state st in S Choose action at in A Receive reward rt = r(st,at) State becomes st+1 according to probabilities p(s,a) Markov Decision Processes • Goal • Create a policy p for choosing actions that maximizes the lifetime reward • Discount Factor g Value = r0 + gr1 + g2r2 + …

  6. MDP Model Example p = 0.3 p = 0.4 r = 1 2 p = 0.7 r = 1 r = 0.2 1 r = 0 p = 0.6 A B C r = 0.2 p = 0.5 2 r = 1 p = 0.5 r = 0 1 r = 1 p = 0.7 p = 0.3 • States A, B, C • Actions 1, 2 • Transition probabilities p and rewards r in diagram • C is terminal state

  7. Decision Tree Representation of MDP Max Q Value Expected Reward Reward A 0.7 1 0 C 0.3 0.3 1 A 0.52 B 0.6 2 0.2 C 0.4 0.52 1

  8. Decision Tree Representation of MDP Expected Reward A Reward 1 0.7 Max Q Value Total Future Reward Max Q Value 0 0.3 A C 0.7 0.3 1 0.52 1 B C 2 0.6 0.3 0.664 1 0.2 0.52 0.4 C A 1 A 1 0.3 0 0.94 0.7 B C 0.6 0.7 2 0.9 1 B C 2 0.5 0.4 0.94 1 0.2 0.6 0.5 C 1

  9. Value Iteration • Finite Horizon, 1 step Q1(s,a) = r(s,a) • Finite Horizon, n steps Qn(s,a) = r(s,a) + gS[p(s’|s,a) maxa’Qn-1(s’,a’)] • Infinite Horizon Q(s,a) = r(s,a) + gS[p(s’|s,a) maxa’Q(s’,a’)] • Policy p(s) = argmaxaQ(s,a)

  10. Q Reinforcement Learning • Q can be calculated only if p(s,a) and r(s,a) are known • Otherwise Q can be trained: Qt+1(s,a) = (1-b)Qt(s,a) + b[R+gmaxa’Qt(s’,a’)] • Perform trial runs; get data from observations • Keep running combinations of s and a until convergence

  11. Partially Observable MDP’s • Can no longer directly observe the state of the system • Instead at each step make an observation O, and know the probability of each observation for each state p(O|S) • Also have initial state distribution p(s0) t1 t2 t3 S S S p(s,a) Hidden p(O|S) O O O Observable A A

  12. State Estimation • States and Observations not independent of past observations • Maintain a distribution of state probabilities called a belief b bt = p(st|ot,at,ot-1,at-1,…,o0,a0,s0) o State Estimator Policy b a

  13. Value Iteration with POMDPs Q(b,a) Q(a1) = Q(s1,a1)(1-b(s2)) + Q(s2,a1)b(s2) • Finite horizon value function is piecewise linear and convex • Depending on the value of b(s2), a different action is optimal • Here is a one-step finite horizon problem • Need to consider effect on future belief states Q(a2) 0 1 b(s2)

  14. Decision Tree Representation of POMDP a1 o1 b211 a2 a1 a1 o2 b212 a2 b1 a1 o1 b221 a2 a2 a1 o2 b222 a2

  15. Approximation: Point-Based • Concentrate on specific points in continuous belief space • How should beliefs be chosen? • Problems with high dimensionality b3 b7 b6 b2 b4 b5 b8 b1

  16. Approximation: Monte Carlo • Estimate values of actions for samples of belief space • Interpolate for arbitrary values • Computationally difficult to learn policy a1 b1 bi bj bk

  17. Approximation: Grid-Based • Like point-based approximation, but with regular spacing • Can use variable resolution • Size increases with dimensionality and resolution b1 b2 b10 b3 b11 b12 b13 b4 b5 b14 b6 b7 b8 b9

  18. Heuristic: Maximum-Likelihood • Assume that the state is the most likely state, and perform corresponding MDP. • Only works if this is a good assumption! b = <0.08, 0.02, 0.1, 0.8> POMDP b = < 0 , 0 , 0 , 1 > MDP

  19. Heuristic: Information Gathering b1 • MDP is easy to solve • Valuable to learn about the state b = < 1 , 0 , 0 , 0 > b = < 0.1, 0.2, 0.4, 0.3 >

  20. Heuristic: Hierarchical Method • Split into top level POMDP and low level POMDPs • Smaller state space reduces amount of computation Belief Space POMDP of POMDPs

  21. Heuristic: Policy Search • Reinforcement learning on policy instead of value • Value and Policy Search • Combine value and policy reinforcement learning into a single update rule

  22. Outline • Introduction to POMDPs • Demonstration of POMDPs • Problem : UAV Navigation • MDP Visualizations • POMDP Visualizations • Approximating POMDPs with Macro Actions

  23. Demonstrations* • MATLAB • MDP • Map to Problem Domain • Value Iteration • Policy Execution • Visualization** • POMDP • Map to Problem Domain • Belief Space Search • Policy Execution • Visualization** * All coding is mine except for some basic MDP value iteration functionality. ** Visualizations run in line with code.

  24. Land at Airport UAV i + UAV: Problem : Introduction INFO Landing Runway Take-off Runway +

  25. Outline • Introduction to POMDPs • Demonstration of POMDPs • Problem : UAV Navigation • MDP Visualizations • Value Iteration • Policy Execution • POMDP Visualizations • Approximating POMDPs with Macro Actions

  26. Problem : Utility Overlay Utility Overlay MDP i (#) Reward State Utility Transition Model + .8 .1 .1 UAV: .04 Cost Probability INFO Exclusion Zone

  27. MDP Value Iteration Value Iteration : Until Convergence MDP i (#) Reward State Utility Transition Model + .8 .1 .1 UAV: .04 Cost Probability Policy INFO

  28. MDP Value Iteration Value Iteration : Until Convergence MDP i (#) Reward State Utility Transition Model + .8 .1 .1 UAV: .04 Cost Probability Policy INFO .

  29. MDP Policy Execution Policy Execution MDP i INFO (#) Reward State Utility Transition Model + .8 .1 .1 UAV: .04 Cost Probability Policy

  30. MDP Policy Execution Policy Execution MDP i INFO (#) Reward State Utility Transition Model + .8 .1 .1 UAV: .04 Cost Probability Policy .

  31. MDP Policy Execution Policy Execution MDP i INFO (#) Reward State Utility Transition Model + .8 .1 .1 UAV: .04 Cost Probability Policy

  32. MDP Policy Execution Policy Execution MDP i INFO (#) Reward State Utility Transition Model + .8 .1 .1 UAV: .04 Cost Probability Policy .

  33. Outline • Introduction to POMDPs • Demonstration of POMDPs • Problem : UAV Navigation • MDP Visualizations • POMDP Visualizations • Belief State Space Search • Policy Comparison • 3 Step Horizon • 10 Step Horizon • AIMA (Russell & Norvig) • Fuel Observation • Approximating POMDPs with Macro Actions

  34. POMDP Belief State Space Search Belief State Space Search : No Observations POMDP i + UAV: INFO (#) Reward State Probability Total Belief State Reward Policy 3 Step Horizon

  35. POMDP Belief State Space Search Belief State Space Search : No Observations POMDP i INFO (#) Reward + Total Belief State Reward UAV: State Probability State Probability Policy Policy 3 Step Horizon 3 Step Horizon .

  36. POMDP Policy (3 Iterations) Policy Execution : No Observations POMDP i INFO (#) Reward + Total Belief State Reward UAV: State Probability Policy < Right, Right, Right >

  37. POMDP Policy (3 Iterations) Policy Execution : No Observations POMDP i INFO (#) Reward + Total Belief State Reward UAV: State Probability Policy < Right, Right, Right > .

  38. POMDP Policy (10 Iterations) Policy Execution : No Observations POMDP i INFO (#) Reward + Total Belief State Reward UAV: State Probability Policy < Left, Up, Up, Right, Up, Up, Right, Right, Right, Right >

  39. POMDP Policy (10 Iterations) Policy Execution : No Observations POMDP i INFO (#) Reward + Total Belief State Reward UAV: State Probability Policy < Left, Up, Up, Right, Up, Up, Right, Right, Right, Right > .

  40. POMDP Policy (AIMA) Policy Execution : No Observations POMDP i INFO (#) Reward + Total Belief State Reward UAV: State Probability Policy < Left, Up, Up, Right, Up, Up, Right, Up, Up, Right >

  41. POMDP Policy (AIMA) Policy Execution : No Observations POMDP i INFO (#) Reward + Total Belief State Reward UAV: State Probability Policy < Left, Up, Up, Right, Up, Up, Right, Up, Up, Right > .

  42. POMDP Policy (Fuel Observation) Policy Execution : Fuel Observation POMDP i INFO (#) Reward + Total Belief State Reward UAV: State Probability Policy < Up, Up, Right, Right, Right, Up, Right, Right, Right, Right >

  43. POMDP Policy (Fuel Observation) Policy Execution : Fuel Observation POMDP i INFO (#) Reward + Total Belief State Reward UAV: State Probability Policy < Up, Up, Right, Right, Right, Up, Right, Right, Right, Right > .

  44. Demonstration Conclusion • Difficult to Solve • Efficient Methods

  45. Outline • Introduction to POMDPs • Demonstration of POMDPs • Approximating POMDPs with Macro Actions • Belief Compression using Grid Approximation • Macro-Actions in the Semi-MDP’s framework • Reinforcement Learning with a model

  46. Approximation Methods • Point-Based Approximations • Monte Carlo Method • Grid-Based Approximations • Dynamic Grid-Resolution

  47. Policy Heuristics • Maximum-Likelihood • Information Gathering • Hierarchical Methods • Macro-Actions • Policy Search

  48. Grid Approximation S2 (0,1,0) S2 (0,1,0) S2 (0,1,0) S3 (0,0,1) S1 (1,0,0) S3 (0,0,1) S1 (1,0,0) S3 (0,0,1) S1 (1,0,0) Resolution 1 Resolution 2 Resolution 4

  49. Dynamic Grid Allocation • Allocate grid points from a uniformly spaced grid dynamically • Simulate trajectories of the agent through the belief space • When experiencing belief state: • Find grid point closest to belief state • Add it to set of grid points explicitly considered • Derived set of grid points is small and adapted to parts of belief space typically inhabited by agent

  50. Macro-Actions • Temporally Extended Actions • For example: • Move-Down-Corridor • Go-to-Chicago • Versus primitive actions, such as: • Turn-Right • One-Step-Forward • POMDP’s breakdown into a hierarchical structure of smaller POMDP’s

More Related