1 / 28

Hierarchical Lookahead Agents: A Preliminary Report

Hierarchical Lookahead Agents: A Preliminary Report. Bhaskara Marthi Stuart Russell Jason Wolfe. High-level actions. For our purposes, a high-level action (HLA) Has a set of possible immediate refinements into sequences of actions, primitive or high-level

manon
Download Presentation

Hierarchical Lookahead Agents: A Preliminary Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical Lookahead Agents:A Preliminary Report Bhaskara Marthi Stuart Russell Jason Wolfe

  2. High-level actions • For our purposes, a high-level action (HLA) • Has a set of possible immediate refinements into sequences of actions, primitive or high-level • Each refinement may have a precondition on its use • Many of the actions we do are high-level • Go to work • Write a NIPS paper • This definition generalizes others we’ve seen • Options don’t provide a choice of refinements • MAXQ tasks can’t represent sequencing constraints

  3. Rxa1 c2 c2 Kxa1 Qa4 ‘a’ ‘b’ a b ‘b’ ‘a’ ‘b’ ‘a’ aa ab ba bb Abstract Lookahead • k-step lookahead >> 1-step lookahead • e.g., chess • k-step lookahead no use if steps are too small • e.g., first k characters of a NIPS paper • this is one small part of a human life, = ~20,000,000,000,000 primitive actions Kc3 • Abstract plans with HLAs are shorter • Much shorter plans for given goals => exponential savings • Can look ahead much further into future • Requires a model (i.e., transition and reward fns) for HLAs • No suitable models to be found in HRL or HTN literature • We proposed angelic semantics [MRW 07], which we extend here

  4. S S S s0 s0 s0 a4 a1 a3 a2 t t t Angelic semantics for HLAs • Models HLAs in deterministic domains without rewards • Central idea is reachable set of an HLA from some state • When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal • May seem related to nondeterminism • But the nondeterminism is angelic: the “uncertainty” will be resolved by the agent, not an adversary or nature State space h1h2is a solution h2 h1 h1h2 h2 Goal

  5. Contributions • Extended angelic semantics to account for rewards • Developed novel algorithms that do lookahead • Hierarchical Forward Search • An optimal, offline algorithm • Can prove optimality of plans entirely at high level • Hierarchical Real-Time Search • An online algorithm • Requires only bounded computation per time step • Significantly outperforms previous ``flat’’ lookahead algorithms • Both algorithms require three inputs: • a deterministic MDP model • an action hierarchy (HLAs) • models for HLAs (based on angelic semantics)

  6. S S -5 s0 s0 3 -8 -1 -4 t t Transition & reward functions for action a1 Deterministic MDPs • A deterministic MDP consists of: • State space S • Initial state s02 S, (w.l.o.g.) single terminal state t  S • Primitive action set A • Transition function f : S x A ! S • Reward function r : S x A !R • Undiscounted, assume no non-negative-reward cycles

  7. Example – Warehouse World • Has similarities to blocks and taxi domains, but more choices and constraints • Gripper must stay in bounds • Can’t pass through blocks • Can only turn around at top row • All actions have reward -1 • Goal: have C on T4 • Can’t just move directly • Final plan has 22 steps Left, Down, Pickup, Up, Turn, Down, Putdown, Right, Right, Down, Pickup, Left, Put, Up, Left, Pickup, Up, Turn, Right, Down, Down, Putdown

  8. Some HLAs for warehouse world • Each HLA has a set of immediate refinements into sequences of actions • All plans of interest are refinements of the special HLA Act • HLAs define a context free grammar over primitive plans (Act is the start symbol) [Act] … [Move(B,C) Act] … [Get(B) PutOn(C)] … [GetL(B)] [GetR(B)] … …

  9. Act Move(C,B) PutOn(A) 3 4 2 1 Act 2 1 Move(B,C) Get(C) Act 1 Act Move(C,A) 3 Move(C,B) 3 0 2 0 Act Move(A,C) 2 Act 5 3 2 1 Abstract Lookahead Trees • ALTs are primary data structures • Represent a set of potential plans • Edges are (possibly abstract) actions • Basic operation: refine a plan (replace with all refs at some HLA) • Nodes represent potential situations(pairs of upper & lower valuations)

  10. S S 3 -1 s0 s0 - - - - - - 3 3 3 4 h 2 -9 -9 7 7 7 - - - 5 - - - - - - - - - 7 1 2 5 - - - - - - - 2 t t 4 2 - - - - - - - Modeling HLAs • An HLA is fully characterized by primitive MDP + hierarchy • But without abstraction, we lose benefits of hierarchy • Extension of idea from “Angelic Semantics for HLAs” [MRW ‘07]: • Valuation of HLA h from state s: • For each s’, best reward of any prim. ref. of h that takes s to s’ • Exact description of h= valuation of h from each s • But this description has no compact, efficient rep. in general

  11. Lower S S 1 s0 s0 8 Upper t t Upper and lower valuations • Instead, use approximatevaluations • Upper valuations never underestimate best achievable reward • Lower valuations never overestimate best achievable reward • We choose a simple form: reachable set + reward bound on set - - - 7 5 Exact - 2 4 6 -3 - - - - -

  12. ~ s t ~ s ~ t ~ x s t Representing descriptions: NCSTRIPS • Assume S = set of truth assignments to propositions • Descriptions specify propositions (possibly) added/deleted by HLA • Also include a bound on rewards • An efficient algorithm exists to progress valuations (represented as DNF formulae + numeric bound) through descriptions Navigate(xt,yt) (Pre: At(xs,ys)) Upper: -At(xs,ys), +At(xt,yt), -FacingRight, +FacingRight, +R(-|xs-xt| - |ys-yt|) Lower: IF (Free(xt,yt) x Free(x,ymax)): -At(xs,ys), +At(xt,yt), -FacingRight, +FacingRight, +R(-|xs-xt| -ys-yt+1) ELSE: nil

  13. S S S s0 s0 s0 0 -5 0 Move(A,C) Act -3 -9 t t t Back to ALT nodes • Initially, we know we are at s0with reward 0 • Each ALT node is created from parent & descriptions of action from parent • Valuations give provable bounds on reachable states & rewards • Lower valuation after Act usually vacuous, upper valuation gives admissible heuristic to t

  14. S S S s0 s0 s0 -5 -7 t t t Pruning Prune whenever a node’s lower valuation dominates another node’s upper valuation 0 -9 -5 -7

  15. Hierarchical Forward Search • Construct the initial ALT • Loop • Select the plan with highest upper reward • If this plan is primitive, return it • Otherwise, refine one of its HLAs • Related to A*

  16. Intuitive Picture t s0 Act highest-level primitive

  17. Intuitive Picture t s0 Act highest-level primitive

  18. Intuitive Picture t s0 Act highest-level primitive

  19. Intuitive Picture Act t s0 highest-level primitive

  20. Intuitive Picture Act t s0 highest-level primitive

  21. Intuitive Picture Act t s0 highest-level primitive

  22. Intuitive Picture t s0 highest-level primitive

  23. Intuitive Picture t s0 highest-level primitive

  24. Analysis • Optimality follows from guarantees offered by upper descriptions • Accurate upper descriptions --> more directed search • Accurate lower descriptions --> more pruning

  25. s0 t An online version? • Hierarchical Forward Search plans all the way to t before starting acting • Can’t act in real time • Not really useful for RL • In stochastic case, must plan for all contingencies • How can we choose a good first action with bounded computation?

  26. Hierarchical Real-Time Search • Loop until at t • Run Hierarchical Forward Search for (at most) k steps • Choose a plan that • begins with a primitive action • has total upper reward ≥ that of the best unrefined plan • has minimal upper reward up to but not including Act • Do the first action of this plan in the world • Do a Bellman backup for the current state (prevents looping) • Related to LRTA*

  27. Preliminary Results Instance 1 (small) Instance 2 (larger)

  28. Discussion • We see a definite benefit to high-level lookahead • Still some subtleties still to be worked out • Each HLA description has different biases, making direct comparisons between plans difficult • Current scheme is overly optimistic • Other directions for future work: • Extensions to probabilitic/partially obs. environments • Better meta-level control • Inferring/learning HLA models • Integration into RL algorithms • Abstract-level DP • … Questions?

More Related