Create Presentation
Download Presentation

Download Presentation
## Distributed Planning in Hierarchical Factored MDPs

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Distributed Planning in Hierarchical Factored MDPs**Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University**Multiagent Coordination Examples**• Search and rescue • Factory management • Supply chain • Firefighting • Network routing • Air traffic control • Access only local information • Distributed Control • Distributed Planning**Hierarchical Decomposition**Part-of Part-of • Subsystems can share variables • Each subsystem only observes its local variables • Parallel decomposition ! exponential state space Cylinders Chassis Injection Steering Engine Exhaust**Outline**• Object-based Representation • Hierarchical Factored MDPs • Distributed planning • Message passing algorithm based on LP decomposition • Hierarchical action selection mechanism • Limited observability and communication • Reusing plans and computation • Exploit classes of objects**G**I I’ R S Basic Subsystem MDP Speed control • Subsystem j decomposed: • Internal variables Xj • External variables Yj • Actions Aj • Subsystem model: • Rewards - Rj(Xj , Yj , Aj) • Transitions - Pj (Xj’ | Xj , Yj , Aj) • Subsystem can be modeled with any representation Actions External variables Internal variables ’**F**S C G I G T M1 Transmission Hierarchical Subsystem Tree • Subsystem tree: • Nodes are subsystems • Hierarchical decomposition • Tree reward = sum subsystem rewards • Consistent subsystem tree: • Running intersection property • Consistent dynamics • Lemma: consistent subsystem tree yields well-defined global MDP SepSet[M2]: {G , } M2Speed control SepSet[M3]: {} M3 Cooling**X1**X’1 M1 X1 X’1 R1 A1 h1 R1 A1 X2 X’2 X2 X’2 h2 R2 SepSet[M2] X3 X’3 R3 A1 A2 M2 X1 X1 A1 X2 X2 X’2 R2 X3 X’3 R3 A2 Relationship to Factored MDPs Hierarchical Factored MDP Multiagent Factored MDP [Guestrin et al. ’01] • Representational power equivalent • Hierarchical factored MDP multiagent factored MDP with particular choice of basis functions • New capabilities • Fully distributed planning algorithm • Reuse for knowledge representation • Reuse of computation • MDP counterpart to Object-Oriented Bayes Nets (OOBNs) [Koller and Pfeffer ’97]**Planning for Hierarchical Factored MDPs**• Action space: joint action a= {a1,…, an} for all subsystems • State space: joint state x of entire system • Reward function: total reward r • Action and state spaces are exponential in # subsystems • Exploit hierarchical structure • Efficient, distributed approximate planning algorithm • Simple message passing approach • Each subsystem accesses only its local model • Each local model solved by any standard MDP algorithm**Solving MDPs as LPs**• Bellman constraint: if x a y with reward r, V(x) V(y) + r = Q(a, x) • Similarly for stochastic transitions • Optimal V* satisfies all Bellman constraints, and is componentwise smallest min V(x)+V(y)+V(z)+V(g) st V(x) V(y)+1 V(y) V(g)+3 V(x) V(z)+2 V(z) V(g)+1**Decomposable Value Functions**Linear combination of restricted domain functions [Bellman et al. ’63] [Schweitzer & Seidmann ’85] [Tsitsiklis & Van Roy ’96] [Koller & Parr ’99,’00] [Guestrin et al. ’01] • Each hiis status of small part(s) of a complex system: • Status of a machine and neighbors • Load on machine • Must find w giving good approximate value function • Well-designed hiexponentially fewer parameters**Approximate Linear Programming**• To solve subsystem tree MDP as LP • Overall state is cross-product of subsystem states • Bellman LP has exponentially many constraints, variables we need to approximate • Write V(x) = V1(X1) + V2(X2) + ... Minimize V1(X1) + V2(X2) + ... s.t. V1(X1) + V2(X2) + ... V1(Y1) + V2(Y2) + ... + R1 + R2 + ... • One variable Vi(Xi) for each state of each subsys • One constraint for every state and action • Vi , Qi depend on small sets of variables/actions • Generates polynomially-sized LPs for factored MDPs [Guestrin et al. ‘01]**Overview of Algorithm**Ml • Each subsystem solves a local (stand-alone) MDP • Each subsystem computes messages by solving a simple local LP: • Sends `constraint message’ to its parent • Sends `reward messages’ to its children • Repeat until convergence Reward message Constraint message … … Mj Reward message Constraint message … … Mk**Stand-alone MDPs and Reward Messages**Reward messages Subsystem MDP Stand-alone MDP • State – (Xj , Yj) • Actions – Aj • Rewards – Rj(Xj , Yj , Aj) • Transitions – Pj (Xj’ | Xj , Yj , Aj) • Sj from parent • Sk to children • State – Xj • Actions – (Aj , Yj) • Rewards – Rj(Xj , Yj , Aj) – Sj + k Sk • Transitions – Pj (Xj’ | Xj , Yj , Aj) • Reward messages are over SepSets • Solve stand-alone MDP using any algorithm • Obtain visitation frequencies of resulting policy: • j = discounted frequency of visits to each state-action**S**G I M2Speed control Visitation Frequencies Dual • Discounted frequency of visits to each state action pairs: • Subsystems must agree on the frequency for shared variables ! reward messages • Approx. ! relaxed enforcement of constraints**Overview of Algorithm: Detailed**Ml • Each subsystem solves a local (stand-alone) MDP • Compute local visitation frequencies j • Add constraint to reward message LP • Each subsystem computes messages by solving a simple local LP: • Sends `constraint message’ to its parent – visitation frequencies for SepSet variables • Sends `reward messages’ to its children • Repeat until convergence … … Mj … … Mk**Reward Message LP**Dual • LP yields reward messages Sk for children • Dual yields mixing weights pj , pk enforce consistent frequencies**Computing Reward Messages**Rows of jj and Lj correspond to visitation frequencies and value of each policy visited by Mj Rows of jk are frequencies marginalized to SepSet[Mk] Messages: • Dual of reward message LP generates mixed policies • pj and pk are mixing parameters, force parents and children to agree on visitation of SepSet**Ml**Reward message Constraint message Mj Convergence Result In finite number of iterations, algorithm produces best possible value function (ie, same as centralized planner) • Planning algorithm is a special case of nested Benders decomposition • One Benders split for each internal node N of subsystem tree • One subproblem is N itself • Remaining subproblems are subtrees for N’s children (decompose these recursively) • Master prob is to determine reward messages • Result follows from correctness of Benders decomposition**Hierarchical Action Selection**• Distributed planning obtains value function • Distributed message passing obtains action choice (policy) • Sends conditional value to its parent • Sends action choice to its children • Limited observability • Limited communication Ml Action choice Value of conditional policy … … Mj Action choice Value of conditional policy … … Mk**Reusing Models and Computation**• Classes of objects • Basic subsystems with same rewards and transitions • Reuse in knowledge representation • Library of subsystems • Reusing computation • Compute policy (visitation frequencies) for one subsystem, use it in all subsystems of the same class • Compute messages for one subtree, use them in all equivalent subtrees**Related Work**• Serial decompositions • one subsystem “active” at a time • Kushner & Chen ’74 (rooms in a maze) • Dean & Lin, IJCAI-95 (combines w/ abstraction) • hierarchical is similar (MAXQ, HAM, etc.) • Parallel decompositions • more expressive (exponentially larger state space) • Singh & Cohn, NIPS-98 (enumerates states) • Meuleau et al., AAAI-98 (heuristic for resources)**Related Work**• Dantzig-Wolfe or Benders decomposition • Dantzig ’65 • first used for MDPs in Kushner & Chen ’74 • we are first to apply to parallel subsystems • Variable elimination • well-known from Bayes nets • Guestrin, Koller & Parr NIPS-01**Summary– Hierarchical Factored MDPs**• Parallel decomposition ! Exponential state space • Efficient distributed planning algorithm • Solve local stand-alone MDPs with any algorithm • Reward sharing coordinate subsystem plans • Simple message passing algorithm computes rewards • Hierarchical action selection • Limited communication • Limited observability • Reuse for knowledge representation and computation • General approach for modeling and planning in large stochastic systems