Optimal Sequential Planning in Partially Observable Multiagent Settings

Optimal Sequential Planning in Partially Observable Multiagent Settings Prashant Doshi Department of Computer Science University of Illinois at Chicago Thesis Committee: Piotr Gmytrasiewicz Bing Liu Peter Nelson Gyorgy Turan Avi Pfeffer

Introduction • Background • Well-known framework for planning in single agent partially observable settings: POMDP • Traditional analysis of multiagent interactions: Game Theory • Problem “... there is currently no good way to combine game theoretic and POMDP control strategies.” - Russell and Norvig AI: A Modern Approach, 2nd Ed.

observation action Beliefs Beliefs observation action Introduction General Problem Setting Environment State Optimize an agent’s preferences given beliefs

Introduction Significance: Real world applications • Robotics • Planetary exploration • Surface mapping by rovers • Coordinate to explore pre-defined region optimally Uncertainty due to sensors • Robot soccer • Coordinate with teammates and deceive opponents Anticipate and track others’ actions Spirit Opportunity RoboCup Competition

Introduction 2. Defense • Coordinate troop movements in battlefields Exact “ground situation” unknown • Coordinate anti-air defense units (Noh&Gmytrasiewicz04) • Distributed Systems • Networked Systems • Packet routing • Sensor networks

Introduction Related Work • Game Theory • Learning in repeated games: Convergence to Nash equilibrium • Fictitious play (Fudenberg&Levine97) • Rational (Bayesian) learning (Kalai&Lehrer93, Nyarko97) Shortcomings: Framework of repeated games not realistic • Decision Theory • Multiagent Reinforcement Learning (Littman94, Hu&Wellman98, Bowling&Veloso00) Shortcomings: Assumes state is completely observable, slow in generating an optimal plan • Multi-body Planning: Nash equilibrium • DEC-POMDP (Bernstein et al.02, Nair et al.03) Shortcomings: Restricted to teams, assumes centralized planning

Introduction Limitations of Nash Equilibrium • Not suitable for general control • Incomplete: Does not say what to do off-equilibria • Non-unique: Multiple solutions, no way to choose “…game theory has been used primarily to analyze environments that are at equilibrium, rather than to control agents within an environment.” - Russell and Norvig AI: A Modern Approach, 2nd Ed.

Introduction Our approach – Key ideas: • Integrate game theoretic concepts into a decision theoretic framework • Include possible models of other agents in your decision making  intentional (types)and subintentional models • Address uncertainty by maintaining beliefs over the state and models of other agents  Bayesian learning • Beliefs over models give rise to interactive belief systems  Interactive epistemology, recursive modeling • Computable approximation of the interactive belief system  Finitely nested belief system • Compute best responses to your beliefs  Subjective rationality

Introduction Claims and Contributions • Framework • Novel framework: Applicable to agents in complex multiagent domains that optimize locally with respect to their beliefs • Addresses limitations of Nash eq.: Solution technique is complete and unique (upto plans of equal expected utility) in contrast to Nash equilibrium • Generality: Combines strategic and long-term planning into one framework. Applicable to non-cooperative and cooperative settings • Better quality plans: Interactive beliefs result in plans that have larger values than approaches that use “flat”beliefs

Introduction Claims and Contributions (Contd.) • Algorithms and Analysis • Approximation method • Interactive particle filter: Online (bounded) anytime approximation technique for addressing the curse of dimensionality • Look ahead reachability tree sampling: Complementary method for mitigating the policy space complexity • Exact solutions: Solutions for several non-cooperative and cooperative versions of the multiagent tiger problem • Approximate solutions: Empirical validation of the approximate method using the multiagent tiger and machine maintenance problems • Convergence to equilibria: Theoretical convergence to subjective equilibrium under a truth compatibilitycondition. Illustrated the computational obstacles in satisfying the condition

Introduction Claims and Contributions (Contd.) • Application • Simulation of social behaviors: Agent based simulation of commonly observed intuitive social behaviors • Significant applications in robotics, defense, healthcare, economics, and networking

Roadmap • Interactive POMDPs • Background: POMDPs • Generalization to I-POMDPs • Formal Definition and Key Theorems • Results and Limitations • Approximating I-POMDPs • Curses of Dimensionality and History • Interactive Particle Filter • Convergence and Error Bounds • Results • Sampling the Look Ahead Reachability Tree • Subjective Equilibrium in I-POMDPs • Conclusion

Background: POMDPs Planning in single agent complex domains: Partially Observable Markov Decision Processes Single Agent Tiger Problem Task: Maximize collection of gold over a finite or infinite number of steps while avoiding tiger Tiger emits a growl periodically (GL or GR) Agent may listen oropen doors (L, OL, or OR)

L,GL L,GR OR,* OL,* 0.5 0.5 1 1 0 0 1 0 0.15 0 0.85 Background: POMDPs 2. Update beliefs: Steps to compute a plan 1. Model of the decision making situation:

Background: POMDPs 3. Optimal plan computation: • Build the look ahead reachability tree • Dynamic programming

Interactive POMDPs Generalize POMDPs to multiagent settings • Modify state space • Include models of other agents into the state space • Modify belief update agent type Uncountably infinite Hierarchical belief systems (Mertens&Zamir85, Brandenberger&Dekel93, Aumann&Heifetz02) New belief update: Predict  Correct

Interactive POMDPs Formal Definition and Key Properties Proposition 1 (Sufficiency): In an I-POMDP, belief over is a sufficient statistic for the past history of i’s observations Proposition 2 (Belief Update): Under the BNM and BNO assumptions, the belief update function for I-POMDPi when mj ist is intentional Theorem 1 (Convergence): For any finitely nested I-POMDP, the value iteration algorithm starting from an arbitrary value function converges to a unique fixed point Theorem 2 (PWLC): For any finitely nested I-POMDP, the value function is piecewise linear and convex

agents i & j L,GL pi(TL,pj) pi(TL,pj) pi(TL,pj) pi(TL,pj) L,GR L GL,S L, GL,S pj (TL) pj (TL) pj (TL) pj (TL) L,GR pi(TR,pj) pi(TR,pj) pi(TR,pj) pi(TR,pj) L,GL pj (TL) pj (TL) pj (TL) pj (TL) Interactive POMDPs Results Multiagent Tiger Problem Task: Maximize collection of gold over a finite or infinite number of steps while avoiding tiger Each agent hears growls as well as creaks (S, CL, or CR) Each agent may open doors or listen Each agent is unable to perceive other’s observation Understanding the I-POMDP (level 1) belief update

Value Functions Horizon=2 Horizon=3 Value function (U) Value function (U) pi (TL) pi (TL) POMDP with noise (Level 0 I-POMDP) Level 1 I-POMDP POMDP with noise (Level 0 I-POMDP) Level 1 I-POMDP Interactive POMDPs • Q. Is the extra modeling effort justified? • Q. What is the computational cost? # of POMDPs that need to be solved for level l and K other agents:

0.5 0.5 Pr(TL,pj) 0.5 Pr(TR,pj) 0.5 Pr(TL,pj) Pr(TR,pj) 0.5 0.05 0.5 0.5 pj(TL) pj(TL) OR OR OR L L L L L L L L L L L L L Interactive POMDPs • Interesting plans in the multiagent tiger problem Rule of thumb: Two consistent observations from the same side lead to opening of doors pj(TL) pj(TL) L L GR,S GL,* GR,* GL,CR GR,CR GL,S GL,CL GR,CL OR L L *,* *,* GL,* GR,* GL,* GR,* GL,* *,* GR,* GL,* GR,* GL,* GR,* OR OL

L *,* L GR,* GL,* L L L GL,* *,S GR,* GR,CL GL,S/CL *,CL *,CR GR,S/CR GL,CR L L L L OL OR L OR OL Interactive POMDPs Application • Agent based simulation of intuitive social behaviors Follow the leader Unconditional follow the leader Conditional follow the leader

Interactive POMDPs Limitations Approximation techniques that tradeoff quality with computations are critically required to apply I-POMDPs to realistic settings

Roadmap • Interactive POMDPs • Approximating I-POMDPs • Curses of Dimensionality and History • Key Idea: Sampling • Interactive Particle Filter • Convergence and Error Bounds • Results • Sampling the Look Ahead Reachability Tree • Subjective Equilibrium in I-POMDPs • Convergence of Bayesian Learning • Subjective Equilibrium • Computational Limitations • Conclusion

Approximating I-POMDPs • Two sources of complexity • Curse of dimensionality • Belief dimension  # of interactive states  • Curse of history • Cardinality of the policy space 

1. GL Propagate Weight Resample Monte Carlo Sampling L Projection using Interactive Particle Filter 2. Approximating I-POMDPs Addressing the curse of dimensionality Details of Particle Filtering Single agent tiger problem Projection Overview of our method

Approximating I-POMDPs Interactive Particle Filtering • Propagation: • Sample other’s action • Sample next physical state • For other’s observations • update its belief

where Confidence parameter Approximating I-POMDPs Convergence and Error Bounds • Does not necessarily converge • Theorem: For a singly-nested t-horizon I-POMDP with discount factor , the error introduced by our approximation technique is upper bounded: Chernoff- Hoeffding Bounds

Approximating I-POMDPs Empirical Results • Q. How good is the approximation? Performance Profiles Level 1 belief Level 2 belief Multiagent Tiger Problem

Approximating I-POMDPs Performance Profiles (Contd.) Level 1 belief Level 2 belief Multiagent Machine Maintenance Problem

vs. Approximating I-POMDPs • Q. Does it save on computational costs? Reduction in the # of POMDPs that need to be solved: Runtimes on a Pentium IV 2.0GHz, 2GB RAM, Linux. *=out of memory

Approximating I-POMDPs Reducing the impact of the curse of history • Sample observations while building the look ahead reachability tree • Consider only the likely future beliefs

Approximating I-POMDPs Empirical Results Performance Profiles Horizon 3 Horizon 4 Multiagent Tiger Problem Computational Savings Runtimes on a Pentium IV 2.0GHz, 2GB RAM, Linux. *=out of memory

Roadmap • Interactive POMDPs • Approximating I-POMDPs • Subjective Equilibrium in I-POMDPs • Convergence of Bayesian Learning • Subjective Equilibrium • Computational Limitations • Conclusion • Summary • Future Work

Subjective Equilibrium in I-POMDPs Theoretical Analysis • Joint observation histories in the multiagent tiger problem • Absolute Continuity Condition (ACC): • Agent’s initial belief induced over the future observation paths should not rule out the ones considered possible by the true distribution • Cautious beliefs  “Grain of truth” assumption

Subjective Equilibrium in I-POMDPs • Theorem 1: Under ACC, an agent’s belief over other’s models updated using the I-POMDP belief update converges with probability 1 • Proof sketch: Show that Bayesian learning is a Martingale Apply the Martingale Convergence Theorem (Doob53) • Subjective -Equilibrium (Kalai&Lehrer93): A profile of strategies of agents each of which is an exact best response to a belief that is -close to the true distribution over the observation history • Subjective equilibrium is stable under learning and optimization Prediction

Subjective Equilibrium in I-POMDPs • Corollary: If agents’ beliefs within the I-POMDP framework satisfy the ACC, then after finite time T, their strategies are in subjective -equilibrium, where  is a function of T • When  = 0, subjective equilibrium obtains • Proof follows from the convergence of the I-POMDP belief update • ACC is a sufficient condition, but not a necessary one

Subjective Equilibrium in I-POMDPs Computational Limitations • There exist computable strategies that admit no computable exact best responses (Nachbar&Zame96) • If possible strategies are assumed computable, then i’s best response may not be computable. Therefore, j’s cautious beliefs  grain of truth • Subtle tension between prediction and optimization • Strictness of ACC • Theorem 2: Within the finitely nested I-POMDP framework, all the agents’ beliefs will never simultaneously satisfy the grain of truth assumption

Roadmap • Interactive POMDPs • Approximating I-POMDPs • Subjective Equilibrium in I-POMDPs • Conclusion • Summary • Future Work

Summary • I-POMDP: A novel framework for planning in complex multiagent settings • Combines concepts from decision theory and game theory • Allows strategic as well as long-term planning • Applicable to cooperative and non-cooperative settings • Solution is complete and unique (upto plans of equal expected utility) • Online anytime approximation technique • Interactive Particle Filter: Addresses the curse of dimensionality • Reachability Tree Sampling: Reduces the effect of the curse of history • Equilibria in I-POMDPs • Theoretical convergence to subjective equilibrium given ACC • Computational obstacles to satisfying ACC • Applications • Agent based simulation of social behaviors • Robotics, defense, healthcare, economics, and networking

Future Work • Other approximation methods • Tighter error bounds • Multiagent planning with bounded rational agents • Models for describing bounded rational agents • Communication between agents • Cost & optimality profile for plans as a function of levels of nesting • Other applications

Thank You Questions

Selected PublicationsFull publication list at: http://dali.ai.uic.edu/pdoshi Selected Journals • Piotr Gmytrasiewicz, Prashant Doshi, “A Framework for Sequential Planning in Multiagent Settings”, Journal of AI Research (JAIR), Vol 23, 2005 • Prashant Doshi, Richard Goodwin, Rama Akkiraju, Kunal Verma, “Dynamic Workflow Composition using Markov Decision Processes”, Journal of Web Services Research (JWSR), 2(1):1-17, 2005 Selected Conferences • Prashant Doshi, Piotr Gmytrasiewicz, “A Particle Filtering Based Approach to Approximating Interactive POMDPs”, National Conference on AI (AAAI), pp. 969-974, July, 2005 • Prashant Doshi, Piotr Gmytrasiewicz, “Approximating State Estimation in Multiagent Settings using Particle Filters”, Autonomous Agents and Multiagent Systems Conference (AAMAS), July, 2005 • Piotr Gmytrasiewicz, Prashant Doshi, “Interactive POMDPs: Properties and Preliminary Results”, Autonomous Agents and Multiagent Systems Conference (AAMAS) pp. 1374-1375, July, 2004 • Prashant Doshi, Richard Goodwin, Rama Akkiraju, Kunal Verma, “Dynamic Workflow Composition using Markov Decision Processes”, International Conference on Web Services (ICWS), pp. 576-582, July, 2004 • Piotr Gmytrasiewicz, Prashant Doshi, “A Framework for Sequential Planning in Multiagent Settings”, International Symposium on AI & Math (AMAI), Jan, 2004

Interactive POMDPs • Finitely nested I-POMDP: I-POMDPi,l • Computable approximations of I-POMDPs bottom up • 0th level type is a POMDP

Interactive POMDPs • Solutions to the enemy version of the multiagent tiger problem Agent i believes that j is likely to be uninformed

Interactive POMDPs Agent i believes that j is likely to be almost informed

i likely believes j in almost informed i likely believes j is uninformed Interactive POMDPs The value of an interaction for an agent is more when its enemy is uninformed as compared to when it is informed

V2(b) = -2 OPT(b) = L b= (0.5,0.5) 2 steps to go SE(b,L,GL) SE(b,OR,*) SE(b,L,GR) SE(b,OL,*) b= (0.85,0.15) V1(b) = -1 OPT(b) = L b= (0.15,0.85) V1(b) = -1 OPT(b) = L b= (0.5,0.5) V1(b) = -1 OPT(b) = L b= (0.5,0.5) V1(b) = -1 OPT(b) = L 1 step to go Background: POMDPs Policy Computation , Trace of policy computation

1 step to go 2 steps to go Background: POMDPs Value of all beliefs Value Function Policy • Properties of Value function: • Value function is piecewise linear and convex • Value function converges asymptotically

Level 1 belief Level 2 belief Approximating I-POMDPs Performance Profiles Multiagent Tiger Problem

Optimal Sequential Planning in Partially Observable Multiagent Settings