Distributed Planning in Hierarchical Factored MDPs

# Distributed Planning in Hierarchical Factored MDPs

## Distributed Planning in Hierarchical Factored MDPs

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University

2. Multiagent Coordination Examples • Search and rescue • Factory management • Supply chain • Firefighting • Network routing • Air traffic control • Access only local information • Distributed Control • Distributed Planning

3. Hierarchical Decomposition Part-of Part-of • Subsystems can share variables • Each subsystem only observes its local variables • Parallel decomposition ! exponential state space Cylinders Chassis Injection Steering Engine Exhaust

4. Outline • Object-based Representation • Hierarchical Factored MDPs • Distributed planning • Message passing algorithm based on LP decomposition • Hierarchical action selection mechanism • Limited observability and communication • Reusing plans and computation • Exploit classes of objects

5. G I I’ R S Basic Subsystem MDP Speed control • Subsystem j decomposed: • Internal variables Xj • External variables Yj • Actions Aj • Subsystem model: • Rewards - Rj(Xj , Yj , Aj) • Transitions - Pj (Xj’ | Xj , Yj , Aj) • Subsystem can be modeled with any representation Actions External variables Internal variables  ’

6. F S C G I G    T M1 Transmission Hierarchical Subsystem Tree • Subsystem tree: • Nodes are subsystems • Hierarchical decomposition • Tree reward = sum subsystem rewards • Consistent subsystem tree: • Running intersection property • Consistent dynamics • Lemma: consistent subsystem tree yields well-defined global MDP SepSet[M2]: {G , } M2Speed control SepSet[M3]: {} M3 Cooling

7. X1 X’1 M1 X1 X’1 R1 A1 h1 R1 A1 X2 X’2 X2 X’2 h2 R2 SepSet[M2] X3 X’3 R3 A1 A2 M2 X1 X1 A1 X2 X2 X’2 R2 X3 X’3 R3 A2 Relationship to Factored MDPs Hierarchical Factored MDP Multiagent Factored MDP [Guestrin et al. ’01] • Representational power equivalent • Hierarchical factored MDP  multiagent factored MDP with particular choice of basis functions • New capabilities • Fully distributed planning algorithm • Reuse for knowledge representation • Reuse of computation • MDP counterpart to Object-Oriented Bayes Nets (OOBNs) [Koller and Pfeffer ’97]

8. Planning for Hierarchical Factored MDPs • Action space: joint action a= {a1,…, an} for all subsystems • State space: joint state x of entire system • Reward function: total reward r • Action and state spaces are exponential in # subsystems • Exploit hierarchical structure • Efficient, distributed approximate planning algorithm • Simple message passing approach • Each subsystem accesses only its local model • Each local model solved by any standard MDP algorithm

9. Solving MDPs as LPs • Bellman constraint: if x a y with reward r, V(x)  V(y) + r = Q(a, x) • Similarly for stochastic transitions • Optimal V* satisfies all Bellman constraints, and is componentwise smallest min V(x)+V(y)+V(z)+V(g) st V(x)  V(y)+1 V(y)  V(g)+3 V(x)  V(z)+2 V(z)  V(g)+1

10. Decomposable Value Functions Linear combination of restricted domain functions [Bellman et al. ’63] [Schweitzer & Seidmann ’85] [Tsitsiklis & Van Roy ’96] [Koller & Parr ’99,’00] [Guestrin et al. ’01] • Each hiis status of small part(s) of a complex system: • Status of a machine and neighbors • Load on machine • Must find w giving good approximate value function • Well-designed hiexponentially fewer parameters

11. Approximate Linear Programming • To solve subsystem tree MDP as LP • Overall state is cross-product of subsystem states • Bellman LP has exponentially many constraints, variables  we need to approximate • Write V(x) = V1(X1) + V2(X2) + ... Minimize V1(X1) + V2(X2) + ... s.t. V1(X1) + V2(X2) + ...  V1(Y1) + V2(Y2) + ... + R1 + R2 + ... • One variable Vi(Xi) for each state of each subsys  • One constraint for every state and action  • Vi , Qi depend on small sets of variables/actions  • Generates polynomially-sized LPs for factored MDPs [Guestrin et al. ‘01]

12. Overview of Algorithm Ml • Each subsystem solves a local (stand-alone) MDP • Each subsystem computes messages by solving a simple local LP: • Sends `constraint message’ to its parent • Sends `reward messages’ to its children • Repeat until convergence Reward message Constraint message … … Mj Reward message Constraint message … … Mk

13. Stand-alone MDPs and Reward Messages Reward messages Subsystem MDP Stand-alone MDP • State – (Xj , Yj) • Actions – Aj • Rewards – Rj(Xj , Yj , Aj) • Transitions – Pj (Xj’ | Xj , Yj , Aj) • Sj from parent • Sk to children • State – Xj • Actions – (Aj , Yj) • Rewards – Rj(Xj , Yj , Aj) – Sj + k Sk • Transitions – Pj (Xj’ | Xj , Yj , Aj) • Reward messages are over SepSets • Solve stand-alone MDP using any algorithm • Obtain visitation frequencies of resulting policy: • j = discounted frequency of visits to each state-action

14. S G I  M2Speed control Visitation Frequencies Dual • Discounted frequency of visits to each state action pairs: • Subsystems must agree on the frequency for shared variables ! reward messages • Approx. ! relaxed enforcement of constraints

15. Overview of Algorithm: Detailed Ml • Each subsystem solves a local (stand-alone) MDP • Compute local visitation frequencies j • Add constraint to reward message LP • Each subsystem computes messages by solving a simple local LP: • Sends `constraint message’ to its parent – visitation frequencies for SepSet variables • Sends `reward messages’ to its children • Repeat until convergence … … Mj … … Mk

16. Reward Message LP Dual • LP yields reward messages Sk for children • Dual yields mixing weights pj , pk  enforce consistent frequencies

17. Computing Reward Messages Rows of jj and Lj correspond to visitation frequencies and value of each policy visited by Mj Rows of jk are frequencies marginalized to SepSet[Mk] Messages: • Dual of reward message LP generates mixed policies • pj and pk are mixing parameters, force parents and children to agree on visitation of SepSet

18. Ml Reward message Constraint message Mj Convergence Result In finite number of iterations, algorithm produces best possible value function (ie, same as centralized planner) • Planning algorithm is a special case of nested Benders decomposition • One Benders split for each internal node N of subsystem tree • One subproblem is N itself • Remaining subproblems are subtrees for N’s children (decompose these recursively) • Master prob is to determine reward messages • Result follows from correctness of Benders decomposition

19. Hierarchical Action Selection • Distributed planning obtains value function • Distributed message passing obtains action choice (policy) • Sends conditional value to its parent • Sends action choice to its children • Limited observability • Limited communication Ml Action choice Value of conditional policy … … Mj Action choice Value of conditional policy … … Mk

20. Reusing Models and Computation • Classes of objects • Basic subsystems with same rewards and transitions • Reuse in knowledge representation • Library of subsystems • Reusing computation • Compute policy (visitation frequencies) for one subsystem, use it in all subsystems of the same class • Compute messages for one subtree, use them in all equivalent subtrees

21. Related Work • Serial decompositions • one subsystem “active” at a time • Kushner & Chen ’74 (rooms in a maze) • Dean & Lin, IJCAI-95 (combines w/ abstraction) • hierarchical is similar (MAXQ, HAM, etc.) • Parallel decompositions • more expressive (exponentially larger state space) • Singh & Cohn, NIPS-98 (enumerates states) • Meuleau et al., AAAI-98 (heuristic for resources)

22. Related Work • Dantzig-Wolfe or Benders decomposition • Dantzig ’65 • first used for MDPs in Kushner & Chen ’74 • we are first to apply to parallel subsystems • Variable elimination • well-known from Bayes nets • Guestrin, Koller & Parr NIPS-01

23. Summary– Hierarchical Factored MDPs • Parallel decomposition ! Exponential state space • Efficient distributed planning algorithm • Solve local stand-alone MDPs with any algorithm • Reward sharing coordinate subsystem plans • Simple message passing algorithm computes rewards • Hierarchical action selection • Limited communication • Limited observability • Reuse for knowledge representation and computation • General approach for modeling and planning in large stochastic systems