1 / 31

Concurrent Markov Decision Processes

Concurrent Markov Decision Processes. Mausam, Daniel S. Weld University of Washington Seattle. What action next? . Planning. Environment. Percepts. Actions. Motivation. Two features of real world planning domains : Concurrency (widely studied in the Classical Planning literature)

anaya
Download Presentation

Concurrent Markov Decision Processes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle

  2. What action next? Planning Environment Percepts Actions

  3. Motivation • Two features of real world planning domains : • Concurrency (widely studied in the Classical Planning literature) • Some instruments may warm up • Others may perform their tasks • Others may shutdown to save power. • Uncertainty (widely studied in the MDP literature) • All actions (pick up the rock, send data etc.) have a probability of failure. • Need both!

  4. Probabilistic Planning • Probabilistic Planning typically modeled as Markov Decision Processes. • Traditional MDPs assume a “single action per decision epoch”. • Solving Concurrent MDPs in the naïve way incurs exponential blowups in running times.

  5. Outline of the talk • MDPs • Concurrent MDPs • Present sound pruning rules to reduce the blowup. • Present sampling techniques to obtain orders of magnitude speedups. • Experiments • Conclusions and Future Work

  6. Markov Decision Process • S : a set of states, factored into Boolean variables. • A : a set of actions • Pr (S£ A£ S! [0,1]): the transition model • C (A!R) : the cost model •  : discount factor ( 2 (0,1]) • s0 : the start state • G : a set of absorbing goals

  7. GOAL of an MDP • Find a policy (S ! A) which: • minimises expected discounted cost of reaching a goal • for an infinite horizon • for a fully observable • Markov decision process.

  8. Bellman Backup • Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state. • Given an estimate of J* function (say Jn) • Backup Jn function at state s to calculate a new estimate (Jn+1) as follows Value Iteration Perform Bellman updates at all states in each iteration. Stop when costs have converged at all states.

  9. Min s Bellman Backup Jn Qn+1(s,a) Jn a1 Jn a2 Jn+1(s) Jn a3 Jn Jn Ap(s) Jn

  10. Min s a3 RTDP Trial Jn Qn+1(s,a) amin = a2 Jn a1 Jn Goal a2 Jn+1(s) Jn Jn Jn Ap(s) Jn

  11. Real Time Dynamic Programming(Barto, Bradtke and Singh’95) • Trial : Simulate greedy policy; Perform Bellman backup on visited states • Repeat RTDP Trials until cost function converges • Anytime behaviour • Only expands reachable state space • Complete convergence is slow • Labeled RTDP (Bonet & Geffner’03) • Admissible, if started with admissible cost function. • Monotonic; converges quickly

  12. Concurrent MDPs • Redefining the Applicability function • Ap : S!P(P(A)) • Inheriting mutex definitions from Classical planning: • Conflicting preconditions • Conflicting effects • Interfering preconditions and effects a1 : if p1 set x1 a2 : if : p1 set x1 a1 : set x1 (pr=0.5) a2 : toggle x1 (pr=0.5) a1 : if p1 set x1 a2 : toggle p1 (pr=0.5)

  13. Concurrent MDPs (contd) • Ap(s) = {Acµ A | • All actions in Ac are individually applicable in s. • No two actions in Ac are mutex. } • )The actions in Ac don’t interact with each other. Hence,

  14. Concurrent MDPs (contd) • Cost Model • C : P(A)!R • Typically, C(Ac) <a2 AcC({a}) • Time component • Resource component • (if C(Ac) = … then optimal sequential policy is optimal for concurrent MDP)

  15. Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn Jn s Bellman Backup (Concurrent MDP) Min Exponential blowup to calculate a Bellman Backup! a1 a2 a3 Jn+1(s) a1,a2 a1,a3 a2,a3 a1,a2,a3 Ap(s)

  16. Outline of the talk • MDPs • Concurrent MDPs • Present sound pruning rules to reduce the blowup. • Present sampling techniques to obtain orders of magnitude speedups. • Experiments • Conclusions and Future Work

  17. Combo skipping (proven sound pruning rule) If d Jn(s)e < 1-kQn(s,{a1}) + func(Ac,) Then prune Ac for state s in this backup. Choose a1 as the action with maximum Qn(s,{a1}) to obtain maximum pruning. Use Qn(s,Aprev) as an upper bound of Jn(s). Skips a combination only for current iteration 

  18. Combo elimination (proven sound pruning rule) If b Q*(s,Ac)c > d J*(s)e then eliminate Ac from applicability set of state s. Use J*sing(s) (the optimal cost for single-action MDP as an upper bound of J*(s). Use Qn(s,Ac) as a lower bound of Q*(s,Ac). Eliminates the combination Ac from applicable list of s for all subsequent iterations. 

  19. Pruned RTDP • RTDP with modified Bellman Backups. • Combo-skipping • Combo-elimination • Guarantees: • Convergence • Optimality

  20. Experiments • Domains • NASA Rover Domain • Factory Domain • Switchboard domain • Cost function • Time Component 0.2 • Resource Component 0.8 • State variables : 20-30 • Avg(Ap(s)) : 170 - 12287

  21. Speedups in Rover domain

  22. Stochastic Bellman Backups • Sample a subset of combinations for a Bellman Backup. • Intuition : • Actions with low Q-values have high likelihood to be in the optimal combination. • Sampling Distribution : • (i) Calculate all single action Q-values. (ii) Bias towards choosing combinations containing actions with low Q-values. • Best combinations for this state in the previous iteration (memoization).

  23. Sampled RTDP • Non-monotonic • Inadmissible • )Convergence, Optimality not proven. • Heuristics • Complete backup phase (labeling). • Run Pruned RTDP with value function from Sampled RTDP (after scaling).

  24. Speedup in the Rover domain

  25. Close to optimal solutions

  26. Speedup vs. Concurrency

  27. Varying the num_samples Optimality Efficiency

  28. Contributions • Modeled Concurrent MDPs • Sound, optimal pruning methods • Combo-skipping • Combo-elimination • Fast sampling approaches • Close to optimal solution • Heuristics to improve optimality • Our techniques are general and can be applied to any algorithm – VI, LAO*, etc.

  29. Related Work • Factorial MDPs (Mealeau etal’98, Singh & Cohn’98) • Multiagent planning (Guestrin, Koller, Parr’01) • Concurrent Markov Options (Rohanimanesh & Mahadevan’01) • Generate, test and debug paradigm (Younes & Simmons’04) • Parallelization of sequential plans (Edelkamp’03, Nigenda & Kambhampati’03)

  30. Future Work • Find error bounds, prove convergence for Sampled RTDP • Concurrent Reinforcement Learning • Modeling durative actions (Concurrent Probabilistic Temporal Planning) • Initial Results – Mausam & Weld’04, (AAAI Workshop on MDPs)

  31. Concurrent Probabilistic Temporal Planning (CPTP) • Concurrent MDP • CPTP • Our solution (AAAI Workshop on MDPs) • Model CPTP as a Concurrent MDP in an augmented state space. • Present admissible heuristics to speed up the search and manage the state space blowup.

More Related