Generalizing Plans to New Environments in Multiagent Relational MDPs

# Generalizing Plans to New Environments in Multiagent Relational MDPs

## Generalizing Plans to New Environments in Multiagent Relational MDPs

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

2. Multiagent Coordination Examples • Search and rescue • Factory management • Supply chain • Firefighting • Network routing • Air traffic control • Multiple, simultaneous decisions • Exponentially-large spaces • Limited observability • Limited communication

3. Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen peasant footman building

4. Scaling up by Generalization • Exploit similarities between world elements • Generalize plans: • From a set of worlds to a new, unseen world • Avoid need to replan • Tackle larger problems Formalize notion of “similar” elements Compute generalizable plans

5. Relational Models and MDPs • Classes: • Peasant, Gold, Wood, Barracks, Footman, Enemy… • Relations • Collects, Builds, Trains, Attacks… • Instances • Peasant1, Peasant2, Footman1, Enemy1… • Value functions in class level • Objects of the same class have same contribution to value function • Factored MDP equivalents of PRMs [Koller, Pfeffer ‘98]

6. Gold Peasant P’ G’ Collects AP P G Relational MDPs • Class-level transition probabilities depends on: • Attributes; Actions; Attributes of related objects • Class-level reward function • Instantiation (world) • Number objects; Relations • Well-defined MDP

7. Planning in a World • Long-term planning by solving MDP • # states exponential in number of objects • # actions exponential • Efficient approximation by exploiting structure! • RMDP world is a factored MDP

8. Roadmap to Generalization • Solve 1 world • Compute generalizable value function • Tackle a new world

9. State Dynamics Decisions Rewards P’ P F’ G’ G F AF R H AP E E’ World is a Factored MDP P(F’|F,G,H,AF)

10. Long-term Utility = Value of MDP [Manne `60] • Value computed by linear programming: • One variable V (x) for each state • One constraint for each state x and action a • Number of states and actions exponential!

11. Approximate Value Functions Linear combination of restricted domain functions [Bellman et al. `63] [Tsitsiklis & Van Roy `96] [Koller & Parr `99,`00] [Guestrin et al. `01] • Each Vodepends on state of object and related objects: • State of footman • Status of barracks • Must find Vo giving good approximate value function

12. Single LP Solution for Factored MDPs [Schweitzer and Seidmann ‘85] • Variables for each Vo , for each object  • Polynomially many LP variables • One constraint for every state and action  • Exponentially many LP constraints • Vo , Qo depend on small sets of variables/actions  • Exploit structure as in variable elimination [Guestrin, Koller, Parr `01]

13. Representing Exponentially Many Constraints Exponentially many linear = one nonlinear constraint

14. A D B C + + + max f ( A , B ) f ( A , C ) f ( C , D ) f ( B , D ) 1 2 3 4 , , , A B C D [ ] = + + + max f ( A , B ) f ( A , C ) max f ( C , D ) f ( B , D ) 1 2 3 4 , , A B C D = + + max f ( A , B ) f ( A , C ) g ( B , C ) 1 2 1 , , A B C Variable Elimination • Can use variable elimination to maximize over state space: [Bertele & Brioschi ‘72] Here we need only 23, instead of 63 sum operations • As in Bayes nets, maximization is exponential in tree-width

15. Representing the Constraints • Functions are factored, use Variable Elimination to represent constraints: Number of constraints exponentially smaller

16. Roadmap to Generalization • Solve 1 world • Compute generalizable value function • Tackle a new world

17. Generalization • Sample a set of worlds • Solve a linear program for these worlds: • Obtain class value functions • When faced with new problem: • Use class value function • No re-planning needed

18. Worlds and RMDPs • Meta-level MDP: • Meta-level LP:

19. Class-level Value Functions • Approximate solution to meta-level MDP • Linear approximation • Value function defined in the class level • All instances use same local value function

20. Class-level LP • Constraints for each world represented by factored LP • Number of worlds exponential or infinite • Sample worlds from P()

21. samples Theorem Exponentially (infinitely) many worlds ! need exponentially many samples? NO! Value function within , with prob. at least 1-. Rmax is the maximum class reward Proof method related to [de Farias, Van Roy ‘02]

22. LP with sampled worlds • Solve LP for sampled worlds • Use Factored LP for each world • Obtain class-level value function • New world: instantiate value function and act

23. Learning Classes of Objects • Which classes of objects have same value function? • Plan for sampled worlds individually • Use value function as “training data” • Find objects with similar values • Include features of world • Used decision tree regression in experiments

24. Summary of Generalization Algorithm • Model domain as Relational MDPs • Pick local object value functions Vo • Learn classes by solving some instances • Sample set of worlds • Factored LP computes class-level value function

25. A New World • When faced with a new world , value function is: • Q function becomes: • At each state, choose action maximizing Q(x,a) • Number of actions is exponential! • Each QC depends only on a few objects!!!

26. M1 M4 M2 Q3 M3 Observe only X2 and X3 Local Q function Approximation Q(A1,…,A4, X1,…,X4) Q(A1,…,A4, X1,…,X4) ¼ Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) + Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4) Associated with Agent 3 Limited observability: agent i only observes variables in Qi Must choose action to maximize åi Qi

27. A1 + + + max Q ( A , A ) Q ( A , A ) Q ( A , A ) Q ( A , A ) 1 1 2 2 1 3 3 3 4 4 2 4 , , , A A A A 1 2 3 4 A2 A3 [ ] = + + + max Q ( A , A ) Q ( A , A ) max Q ( A , A ) Q ( A , A ) 1 1 2 2 1 3 3 3 4 4 2 4 , , A A A A 1 2 3 4 = + + max Q ( A , A ) Q ( A , A ) g ( A , A ) A4 1 1 2 2 1 3 1 2 3 , , A A A 1 2 3 If A2 attacks and A3defends, then A4 gets \$10 Maximizing i Qi:Coordination Graph • Use variable elimination for maximization: [Bertele & Brioschi ‘72] • Limited communication for optimal action choice • Comm. bandwidth = induced width of coord. graph

28. Summary of Algorithm • Model domain as Relational MDPs • Factored LP computes class-level value function • Reuse class-level value function in new world

29. Server Ring of Rings Unidirectional Ring Star Experimental Results • SysAdmin problem

30. Generalizing to New Problems

31. Generalizing to New Problems

32. Generalizing to New Problems

33. Leaf Intermediate Server Intermediate Intermediate Leaf Leaf Classes of Objects Discovered • Learned 3 classes

34. Learning Classes of Objects

35. Learning Classes of Objects

36. Results [with Gearhart and Kanodia] • 2 Peasants, Gold, Wood, Barracks, 2 Footman, Enemy • Reward for dead enemy • About 1 million of state/action pairs • Solve with Factored LP • Some factors are exponential • Coordination graph for action selection

37. Generalization • 9 Peasants, Gold, Wood, Barracks, 3 Footman, Enemy • Reward for dead enemy • About 3 trillion of state/action pairs • Instantiate generalizable value function • At run-time, factors are polynomial • Coordination graph for action selection

38. The 3 aspects of this talk • Scaling up collaborative multiagent planning • Exploiting structure • Generalization • Factored representation and algorithms • Relational MDP, Factored LP, coordination graph • Freecraft as a benchmark domain

39. Conclusions • RMDP • Compact representation for set of similar planning problems • Solve single instance with factored MDP algorithms • Tackle sets of problems with class-level value functions • Efficient sampling of worlds • Learn classes of value functions • Generalization to new domains • Avoid replanning • Solve larger, more complex MDPs