1 / 30

Knowledge Representation Meets Stochastic Planning

Bob Givan Joint work w/ Alan Fern and SungWook Yoon. Knowledge Representation Meets Stochastic Planning. Electrical and Computer Engineering Purdue University. Overview. We present a form of approximate policy iteration specifically designed for large relational MDPs .

mihaly
Download Presentation

Knowledge Representation Meets Stochastic Planning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bob Givan Joint work w/ Alan Fern and SungWook Yoon Knowledge RepresentationMeetsStochastic Planning Electrical and Computer Engineering Purdue University

  2. Overview • We present a form of approximate policy iteration specifically designed for large relational MDPs. • We describe a novel application viewing entire planning domains as MDPs • we automatically induce domain-specific planners • Induced planners are state-of-the-art on: • deterministic planning benchmarks • stochastic variants of planning benchmarks Bob Givan Electrical and Computer Engineering Purdue University

  3. Ideas from Two Communities Traditional Planning Induction of Control Knowledge Planning Heuristics Two views of the new technique Iterative improvementof control knowledge API with a policy space bias Decision-theoretic Planning Approximate Policy Iteration(API) Policy Rollout Bob Givan Electrical and Computer Engineering Purdue University

  4. Planning Problems • States: First-order Interpretations of a particular language • A planning problem gives: • a current state • a goal state • a list of actions and their semantics (may be stochastic) ? Available actions: Pickup(x) PutDown(y) Current State Goal State/Region Bob Givan Electrical and Computer Engineering Purdue University

  5. ? ? ? ? Planning Domains • Distributions over problems sharing one set of actions (but with different domains and sizes) Available actions: Pickup(x) PutDown(y) Blocks World Domain Bob Givan Electrical and Computer Engineering Purdue University

  6. ? Control Knowledge • Traditional planners solve problems, not domains. • little or no generalization between problems in a domain • Planning domains “solved” by control knowledge • pruning some actions, typically eliminating search e.g. “don’t pick up a solved block” X ? X ? X Bob Givan Electrical and Computer Engineering Purdue University

  7. Recent Control Knowledge Research • Human-written c. k. often eliminates search [Bacchus & Kabanza, 1996]TL-Plan • Helpful c. k. can be learned from “small problems” [Khardon, 1996 & 1999] Learning Horn clause action strategies [Huang, Selman & Kautz, 2000]Learning action selection & action rejection rules [Martin & Geffner, 2000]Learning generalized policies in concept languages [Yoon, Fern & Givan, 2002]Inductive policy selection for stochastic planning domains Bob Givan Electrical and Computer Engineering Purdue University

  8. Unsolved Problems • Finding control knowledge without immediate access to small problems • Can we learn directly in a large domain? • Improving buggy control knowledge • All previous techniques produce unreliable control knowledge…with occasional fatal flaws. • Our approach: view control knowledge as an MDP policy and apply policy improvement A policy is a choice of action for each MDP state Bob Givan Electrical and Computer Engineering Purdue University

  9. ? ? ? ? Planning Domains as MDPs View domain as one big statespace, each state a planning problem This view facilitates generalization between problems. Available actions: Pickup(x) PutDown(y) Pickup(Purple) Blocks World Domain Bob Givan Electrical and Computer Engineering Purdue University

  10. Ideas from Two Communities Traditional Planning Induction of Control Knowledge Planning Heuristics Two views of the new technique Iterative improvementof control knowledge API with a policy space bias Decision-theoretic Planning Approximate Policy Iteration(API) Policy Rollout Bob Givan Electrical and Computer Engineering Purdue University

  11. Policy Iteration • Given a policy p and a state s, can we improve p(s)? • If Vp(s) < Qp(s,b), then p(s) can be improved to blue. • Can make such improvements at all states at once: s1 Vp(s) = Qp(s,o) = Ro + g Es’{s1…sk} Vp(s’) Ro … p(s) sk s t1 Qp(s,b) = Rb + g Es’{t1…tn} Vp(s’) Rb … tn improved policy base policy Policy Improvement Bob Givan Electrical and Computer Engineering Purdue University

  12. Compute Qpfor each actionat all states Compute Vpat all states Flowchart View of Policy Iteration Qp Improved Policy p’ Choose best actionat each state Vp Problem: too many states p Current Policy Bob Givan Electrical and Computer Engineering Purdue University

  13. s1 Ra Qp(s,•) Sample s’ from s1…sk a Compute Qpfor each actionat all states … s Choose best actionat each state sk s p’(s) s at s at s s’ Vp(s’) Trajectories under p … … … Compute Vpat all states at s’ p(s’) s’ … … s” p(s”) Current Policy p Flowchart View of Policy Rollout Qp Improved Policy Vp p Bob Givan Electrical and Computer Engineering Purdue University

  14. Compute Qpfor each actionat state s Choose best actionat state s Approximate Policy Iteration Idea: use machine learning to control the number of samples needed draw a training set of pairs (s,p’(s)) learn a policyrepeat Qp(s,•) p’(s) s s s’ Vp(s’) Compute Vpat state s’ Refinement: use pairs (s,Qp(s,•)) to define mis- classification costs s” p(s”) Current Policy Bob Givan Electrical and Computer Engineering Purdue University

  15. ? A A Challenge Problem Consider the following stochastic blocks world problem: Goal: Clear(A) Assume: Block color affects pickup() success Optimal policy is compact, but value function is not – state value depends on set of colors above A Bob Givan Electrical and Computer Engineering Purdue University

  16. 1. 2. ? A ? A A A Policy for Example Problem A compact policy for this problem: 1. If holding a block, put it down on the table, else… 2. Pick up a clear block above A. How can we formalize this policy? Bob Givan Electrical and Computer Engineering Purdue University

  17. ? A A Action Selection Rules [Martin&Geffner, KR2000] Pickup a clear block above block A… Action selection rules based on classes of objects • Apply action a to an object in class C (if possible). • abbreviated C:a How can we describe the object classes? ? A A Bob Givan Electrical and Computer Engineering Purdue University

  18. ? A A Formal Policy for Example Problem 1. 2. ? A A We find this policy with a heuristic search guided by the training data Bob Givan Electrical and Computer Engineering Purdue University

  19. Ideas from Two Communities Traditional Planning Induction of Control Knowledge Planning Heuristics Two views of the new technique Iterative improvementof control knowledge API with a policy space bias Decision-theoretic Planning Approximate Policy Iteration(API) Policy Rollout Bob Givan Electrical and Computer Engineering Purdue University

  20. API with a Policy Language Bias Compute Qpfor each actionat state s Qp(s,•) p’(s) Choose best actionat state s Train a new policy p’ s s s’ Vp(s’) Compute Vpat state s’ p’ s” p(s”) Current Policy Bob Givan Electrical and Computer Engineering Purdue University

  21. Trajectories under p … … … p(s’) s’ … … Incorporating Value Estimates • What happens if the policy can’t find reward? • For learning control knowledge, we use the FF-plan plangraph heuristic Use a value estimate at these states Bob Givan Electrical and Computer Engineering Purdue University

  22. Initial Policy Choice • Policy iteration requires an initial base policy • Options include: • random policy • greedy policy with respect to a planning heuristic • policy learned from small problems Bob Givan Electrical and Computer Engineering Purdue University

  23. Experimental Domains SBW(n) SPW(n) SLW(t,p,c) (Stochastic)Painted Blocks World (Stochastic)Blocks World (Stochastic)Logistics World Bob Givan Electrical and Computer Engineering Purdue University

  24. API Results Starting with flawed policies learned from small problems Success Rate Success Rate Bob Givan Electrical and Computer Engineering Purdue University

  25. API Results Starting with a policy greedy with respect to adomain independent heuristic We used the heuristic of FF-plan (Hoffman and Nebel ’02 JAIR) Bob Givan Electrical and Computer Engineering Purdue University

  26. How Good is the Induced Planner? Bob Givan Electrical and Computer Engineering Purdue University

  27. Conclusions • Using a policy space bias, we can learn good policies for extremely large structured MDPs. • We can automatically learn domain-specific planners that compete favorably with the state-of-the-art domain-independent planners. Bob Givan Electrical and Computer Engineering Purdue University

  28. Approximate Policy Iteration Sample states s, and compute Q values at each: Form a training set of tuples (s,b,Qp,b(s)). Learn a new policy from this training set. Computing Qp,b(s): s1 Ro … Estimate Rb + g Es’{t1…tn} Vp(s’) by • Sampling states ti from t1…tn • Drawing trajectories under p from tito estimate Vp sk s t1 Rb … tn Bob Givan Electrical and Computer Engineering Purdue University

  29. Markov Decision Process (MDP) • Ingredients: • System state x in state space X • Control action a in A(x) • Reward R(x,a) • State-transition probability P(x,y,a) • Find control policy to maximize objective fun Bob Givan Electrical and Computer Engineering Purdue University

  30. Control Knowledge vs. Policy • Perhaps the biggest difference in communities: • deterministic planning works with action sequences • decision-theoretic planning works with policies • Policies are needed because uncertainty may carry you to any state. • compare: control knowledge also handles every state • Good c.k. eliminates search • defines a policy over the possible state/goal pairs Bob Givan Electrical and Computer Engineering Purdue University

More Related