1 / 33

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up. AAMAS 2012. Ekhlas Sonu , Prashant Doshi Dept. of Computer Science University of Georgia. Overview.

ova
Download Presentation

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up AAMAS 2012 Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University of Georgia

  2. Overview We generalize Bounded Policy Iteration for POMDP to the multiagent decision making framework of Interactive POMDP We discuss the challenges associated with this generalization Substantial scalability achieved using the generalized approach

  3. Introduction: Interactive POMDP • Interactive POMDP (Gmytrasiewicz&Doshi,05): • Generalization of POMDP to multiagent settings • Applications • Money Laundering (Ng et al.,10) • Lemonade stand game (Wunder et al.,11) • Modeling human behavior (Doshi et al.,10), and more… • Differs from Dec-POMDP • Dec-POMDP: Team of agents • I-POMDP: Individual agent in presence of other agents – cooperative, competitive or neutral settings

  4. Introduction: I-POMDP oj/Oj(s’, ai, aj, oj) , Rj(s, ai, aj) • ISi,l= S X Qj,l-1 • S: Set of physical states • Qj,l-1 : Set ofintentional models of j at level l-1 Physical States (S) ai/Ti(s, ai, aj, s’) i j aj/Tj(s, ai, aj, s’) Interactive state • A = Ai X Aj • Wi: set of observations of i Ti: S X Ai X Aj DS • Oi: S X Ai X Aj DWi • Ri: S X Ai X Aj R oi/Oi(s’, ai, aj, oi) , Ri (s, ai, aj) (Finitely-nested and 2 agents) I-POMDPi,l =<ISi,l, A, Wi, Ti, Oi, Ri, γ>

  5. I-POMDP Belief Update and Value Function • Belief Update: • An agent must predict the other agent’s actions by anticipating its updated beliefs over time. Therefore belief update consists of • Updating distribution over physical states: Transition Function, Observation Function of agent i • Updating distribution over dynamic models: Belief update of other agents and its observation function • Value Function: • Must incorporate the I-POMDP belief update in computing long term rewards

  6. Solving I-POMDP (Related Work) • Previous work: Value iteration algorithms • Interactive particle filtering (I-PF) (Doshi&Gmytrasiewicz,09) • nested particle filter: sampled recursive representation of agent’ nested belief • Interactive point-based value iteration (I-PBVI) (Doshi&Perez,08) • point based domination check • Iteratively apply Backup Operator: • Expensive operator • Scale only to toy problems • Over multiple time steps: • Curse of history • Curse of dimensionality Phy. St. (S) b = D(ISi,l)

  7. Background • Policy Iteration • Class of solution algorithms – search policy space • Exponential growth in solution size • Bounded Policy Iteration (Poupart&Boutilier,03) • Fixed solution size (controlled growth) • Applied in POMDP & Dec-POMDP • Dec-BPI (Bernstein,Hansen&Zilberstein,05) -- optional correlation device may not be feasible in non-cooperative settings • Contribution: • We present the first policy iteration algorithm(approximate) for I-POMDPs : generalization of BPI • Show scalability to larger problems

  8. Policy Representation Tree Representation Finite State Controllers (Hansen, 1998) • Node has an infinite horizon policy rooted at it • Node has a value vector associated with it which is a linear vector over the entire belief space • Beliefs are mapped to a node (n) that optimizes the expected reward from that belief: • i.e. argmaxn b∙Vn Possible representation of policy Node  action Edge  obs

  9. Finite State Controller • where: • is the set of nodes in the FSC of agent i • is the set of edge labels (Wi) Let: • partitions the entire belief space A finite state controller may be defined as:

  10. Policy Iteration • Starting with an initial controller, iterate over two steps until convergence: • Policy Evaluation: • Evaluate Vn for each node • Solve system of linear equations • Policy Improvement: • Construct a better controller • Possibly by adding new nodes

  11. Policy Improvement (Hansen,98) V P(s) 0 1 Example of policy iteration for a POMDP • Apply Backup operator, i.e. construct new nodes with all possible values of action and transition on observation • |A||N||W| new nodes • Add them to the controller • Prune all dominated nodes • Drawback: Leads to exponential growth in controller size

  12. Bounded Policy Iteration (BPI) (Poupart&Boutilier,03) e :stochastic action policy :stochastic observation policy • Instead of performing a complete back up, replace a node with a better node • Linear program for partial backup • New node is a convex combination of two backed up nodes • Changes in controller:

  13. Local Optima V 0 1 P(s) This form of policy improvement is prone to converging to local optima When all nodes are tangents to backed up nodes: e = 0, no improvement Escape technique suggested by Poupart & Boutilier (2003) in BPI

  14. I-POMDP Generalization: Nested Controllers Agent i’s level 2 controller: Agent j’s level 1 controller: Agent i’s level 0 controller: • Nested Controllers: Analogous to nested beliefs • Embed recursive reasoning • Starting from level 0 upwards, for each level l, construct a Finite state controller for each frame of each agent ( ) • For convenience of representation, let’s assume two agents and each one frame for an agent at each level

  15. Interactive BPI: Policy Evaluation Compute the value vector of each node using the estimate of other agent’s model by solving a system of linear equations: For each ni,l, and interactive state, is=(s, nj,l-1), solve:

  16. I-BPI: Policy Improvement V e 0 P(s) 1 New vector dominates old vector by e and hence replaces it Pick a node (ni,l) and perform a partial backup using LP to construct another node (n’i,l) that pointwise dominates ni,l by some e > 0

  17. I-BPI: Policy Improvement Objective Function: maximize e Variables: Constraints: Pick a node (ni,l) and perform a partial backup using LP to construct another node that pointwise dominates ni,l by some e > 0

  18. Escaping Local Optima V bR2 bR1 bT Analogous to escaping for POMDPs 0 1 P(s)

  19. Algorithm: I-BPI Ll . . . L1 L0 Time Starting from Level 0 up to Level l, construct a 1 node controller for each level with a random action and transition to itself. Reformulate interactive state space and evaluate

  20. Algorithm: I-BPI Ll . . . L1 L0 Time Starting from Level 0 up to Level l, perform 1 step of back up operator. Max |Ai(j)| nodes

  21. Algorithm: I-BPI Ll . . . L1 L0 Time Starting from Level 0 up to Level l, reformulate IS space, perform policy evaluation followed by policy improvement at each level

  22. Algorithm: I-BPI Ll . . . L1 L0 Time Repeat step 4 until convergence If converged, push nested controller out of local optima by adding new nodes

  23. Evaluation AUAV: 81 states, 5 actions, 4 observations Money Laundering: 99 States, 11 actions, 9 Observations Scales to larger problems... Runtime for algorithm and the average rewards from simulations * Represents expected rewards obtained from vectors

  24. Evaluation Simulations results for multiagent tiger problem showing results obtained by simulating performance of agent controllers of various sizes for Levels 1 – 4

  25. Discussion • Advantages of I-BPI • Is significantly quicker and scales to large problems (100s of states, tens of actions and observations) • Mitigates curse of history and curse of dimensionality • Improved solution quality • Limitations • Prone to local optima • Escape technique may not work for certain local optima • Not entirely free from curses of history and dimensionality • Future Work • Scale to even larger problems and more agents • Mealy machine implementation for controllers (Amato et al. 2011)

  26. Thank you…Poster #731 today at 16:00-17:00 (Panel 98) Acknowledgement: This research is partially supported by an NSF CAREER grant, #IIS-0845036

  27. Policy Improvement |A| Z|Z| Z1 Z2 |N| |N| |N| • Apply Backup operator, i.e. construct new nodes with all possible values of action and transition to nodes in current controller • |A||N||Z| new nodes • Add them to the controller

  28. Introduction: POMDP S: set of states A: set of actions Z: set of observations T: S X A DS • O: S X A  DZ • R: S X A  R a/T(s, a, s’) Physical States (S) g: discount factor h: Horizon b = D(S) • Agent maintains a belief (b) over physical states • Policy p : b  A z/O(s’, a, z) , R(s, a) Objective is to find a policy p that maximizes long term expected rewards: ER = Immediate Reward + discounted future reward POMDP: Framework for optimal sequential decision making under uncertainty in single agent settings <S, A, W, T, O, R, g >

  29. Future Work Extend approach to problems with even larger dimensions Extend to problems with more than two agents Mealy machine implementation of finite state controllers (Amato, et.al; 2011)

  30. I-POMDP Belief Update and Value Function • Value Function: • Belief Update: • An agent must predict the other agent’s actions by anticipating its updated beliefs over time. Therefore belief update consists of • Updating distribution over physical states: Transition Function, Observation Function of agent i • Updating distribution over dynamic models: Belief update of other agents and its observation function

  31. Solving I-POMDP (Related Work) • Previous work: Value iteration algorithms • I-PF (Doshi, Gmytrasiewicz; 2009): • particle filter: sampled recursive representation of agent’ nested belief • I-PBVI (Doshi, Perez; 2008): • point based domination check • Iteratively apply Backup Operator: • Expensive operator • Over multiple time steps: • Curse of history • Curse of dimensionality s, a/T(s, a, s’) Phy. St. (S) b = D(ISi,l) s’/O(s’, a, z), R(s, a)

  32. I-POMDP Generalization: Nested Controllers L 0: L 1: . . . . . . . . . . . . L l: • Embed recursive reasoning • Starting from level 0 upwards, for each level l, construct a Finite state controller for each frame of each agent ( ) • For convenience of representation, let’s assume two agents and each one frame for an agent at each level

  33. I-POMDP Generalization: Nested Controllers • Embed recursive reasoning • Starting from level 0 upwards, for each level l, construct a Finite state controller for each frame of each agent ( ) • For convenience of representation, let’s assume two agents and each one frame for an agent at each level

More Related