1 / 36

Probabilistic Planning via Determinization in Hindsight FF-Hindsight

Probabilistic Planning via Determinization in Hindsight FF-Hindsight. Sungwook Yoon Joint work with Alan Fern, Bob Givan and Rao Kambhampati. Probabilistic Planning Competition. Client : Participants, send action Server: Competition Host, simulates actions. The Winner was ……. FF-Replan

elysia
Download Presentation

Probabilistic Planning via Determinization in Hindsight FF-Hindsight

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Planning via Determinization in HindsightFF-Hindsight Sungwook Yoon Joint work with Alan Fern, Bob Givan and Rao Kambhampati

  2. Probabilistic Planning Competition Client : Participants, send action Server: Competition Host, simulates actions

  3. The Winner was …… • FF-Replan • A replanner. Use FF • Probabilistic domain is determinized • Interesting Contrast • Many probabilistic planning techniques • Work in theory but does not work in practice • FF-Replan • No theory • Work in practice

  4. The Paper’s Objective Better determinization approach (Determinization in Hindsight) Theoretical consideration of the new determinization (in Hindsight) New view on FF-Replan Experimental studies with determinization in Hindsight (FF-Hindsight)

  5. Probabilistic Planning(goal-oriented) Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 Dead End Action Goal State State

  6. All Outcome Replanning (FFRA) ICAPS-07 Effect 1 Action1 Effect 1 Probability1 Action Probability2 Effect 2 Action2 Effect 2

  7. Probabilistic PlanningAll Outcome Determinization Action Find Goal I Probabilistic Outcome A1 A2 Time 1 A1-1 A1-2 A2-1 A2-2 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 Dead End Action Goal State State

  8. Probabilistic PlanningAll Outcome Determinization Action Find Goal I Probabilistic Outcome A1 A2 Time 1 A1-1 A1-2 A2-1 A2-2 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 Dead End Action Goal State State

  9. Problem of FF-Replan and better alternative sampling FF-Replan’s Static Determinizations don’t respect probabilities. We need “Probabilistic and Dynamic Determinization” Sample Future Outcomes and Determinization in Hindsight Each Future Sample Becomes a Known-Future Deterministic Problem

  10. Probabilistic Planning(goal-oriented) Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 Dead End Action Goal State State

  11. Start Sampling Note. Sampling will reveal which is better A1? Or A2 at state I

  12. Hindsight Sample 1 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 1 A2: 0 Dead End Action Goal State State

  13. Hindsight Sample 2 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 2 A2: 1 Dead End Action Goal State State

  14. Hindsight Sample 3 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 2 A2: 1 Dead End Action Goal State State

  15. Hindsight Sample 4 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 3 A2: 1 Dead End Action Goal State State

  16. Summary of the Idea:The Decision Process(Estimating Q-Value, Q(s,a)) S: Current State, A(S) → S’ 1. For Each Action A, Draw Future Samples Each Sample is a Deterministic Planning Problem 2. Solve The Deterministic Problems The solution length is used for goal-oriented problems, Q(s,A) 3. Aggregate the solutions for each action Max A Q(s,A) 4. Select the action with best aggregation

  17. Mathematical Summary of the Algorithm • H-horizon future FH for M = [S,A,T,R] • Mapping of state, action and time (h<H) to a state • S × A × h → S • Value of a policy π for FH • R(s,FH, π) • VHS(s,H) = EFH [maxπ R(s,FH,π)] • Compare this and the real value • V*(s,H) = maxπ EFH [ R(s,FH,π) ] • VFFRa(s) = maxF V(s,F) ≥ VHS(s,H) ≥ V*(s,H) • Q(s,a,H) = (R(a) + EFH-1 [maxπ R(a(s),FH-1,π)] ) • In our proposal, computation of maxπ R(s,FH-1,π) is approximately done by FF [Hoffmann and Nebel ’01] Each Future is a Deterministic Problem Done by FF

  18. Key Technical Results The Importance of Independent Sampling of States, Actions, Time The necessity of Random Time Breaking in Decision making We identify the characteristic of FF-Replan in terms of Hindsight Decision Making, VFFRa(s) = maxFV(s,F) Theorem 1 When there is a policy that can achieve the goal with probability 1 within horizon, hindsight decision making algorithm will find the goal with probability 1. Theorem 2 Polynomial number of samples are needed with regard to, Horizon, Action, The minimum Q-value advantage

  19. Empirical Results IPPC-04 Problems Numbers are solved Trials For ZenoTravel, when we used Importance sampling, the solved trials have been improved to 26

  20. Empirical Results These Domains are Developed just to Beat FF-Replan Obviously, FF-Replan did not do well. But, FF-Hindsight did very well, showing Probabilistic Reasoning Ability while achieving Scalability

  21. Conclusion Deterministic Planning Probabilistic Planning scalability scalability Classic Planning Markov Decision Processes Machine Learning for Planning Machine Learning for MDP Net Benefit Optimization Temporal MDP Temporal Planning scalability Determinization

  22. Conclusion • Devised an algorithm that can take advantage of the significant advances in deterministic planning in the context of probabilistic planning • Made many of the deterministic planning techniques available to probabilistic planning • Most of the learning to planning techniques are developed solely for deterministic planning • Now, these techniques are relevant to probabilistic planning too • Advanced net-benefit style of planners can be used for the reward maximization style of probabilistic planning problems

  23. Discussion • Mercier and Van Hentenryck provided the analysis of the difference between • V*(s,H) = maxπ EFH [ R(s,FH,π) ] • VHS(s,H) = EFH [maxπ R(s,FH,π)] • Ng and Jordan provided the analysis of the difference between • V*(s,H) = maxπ EFH [ R(s,FH,π) ] • V^(s,H) = maxπ ∑ [ R(s,FH,π) ] / m, where m is the sample number

  24. IPPC-2004 Results Winner of IPPC-04 FFRs Human Control Knowledge Numbers : Successful Runs Learned Knowledge 2nd Place Winners

  25. IPPC-2006 Results Numbers : Percentage of Successful Runs Unofficial Winner of IPPC-06 FFRa

  26. Sampling ProblemTime dependency issue S1 S2 A Start Goal D (with probability 1-p) B C (with probability p) S3 C (with probability 1-p) D (with probability p) Dead End

  27. Sampling ProblemTime dependency issue S1 S2 A Start Goal B S3 Dead End S3 is worse state then S1 but looks like there is always a path to GoalNeed to sample independently across actions

  28. Action Selection ProblemRandom Tie breaking is essential B: with probability 1-p A: Always stays in Start Start S1 Goal B: with probability p C: with probability 1-p C: with probability p In Start state, C action is definitely better, but A can be used to wait until C to the Goal effect is realized

  29. Sampling ProblemImportance Sampling (IS) B: with very high probability Start S1 Goal B: with extremely low probability - Sampling uniformly would find the problem unsolvable. - Use importance sampling. - Identifying the region that needs importance sampling is for further study.-In the benchmark, Zenotravel needs the IS idea.

  30. Theoretical Results • Theorem 1 • For goal-achieving probabilistic planning problems, if there is a policy that can solve the probabilistic planning problem with probability 1 with bounded horizon, then hindsight planning would solve the problem with probability 1. If there is no such policy, hindsight planning would return less 1 success ratio. • If there is a future where no plan can achieve the goal, the future can be sampled • Theorem 2 • The number of future samples needed to correctly identify the best action • w > 4Δ-2T ln (|A|H| / δ) • Δ : the minimum Q-advantage of the best action over the other actions, δ: confidence parameter • From Chernoff Bound

  31. Probabilistic PlanningExpecti-max solution Action Maximize Goal Achievement Probabilistic Outcome Max Time 1 Exp Exp Max Max Max Max Time 2 E E E E E E E E Action Goal State State

  32. Hindsight Sample 1 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 1 A2: 0 Dead End Action Goal State State

  33. Hindsight Sample 2 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 2 A2: 1 Dead End Action Goal State State

  34. Hindsight Sample 3 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 2 A2: 1 Dead End Action Goal State State

  35. Hindsight Sample 4 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 3 A2: 1 Dead End Action Goal State State

More Related