1 / 30

Automatic Induction of MAXQ Hierarchies

Automatic Induction of MAXQ Hierarchies. Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University. Funded by DARPA Transfer Learning Program. Hierarchical Reinforcement Learning. Exploits domain structure to facilitate learning

lam
Download Presentation

Automatic Induction of MAXQ Hierarchies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University Funded by DARPA Transfer Learning Program

  2. Hierarchical Reinforcement Learning • Exploits domain structure to facilitate learning • Policy constraints • State abstraction • Paradigms: Options, HAMs, MaxQ • MaxQ task hierarchy • Directed acyclic graph of subtasks • Leaves are the primitive MDP actions • Traditionally, task structure is provided as prior knowledge to the learning agent

  3. Model Representation • Dynamic Bayesian Networks for the transition and reward models • Symbolic representation of the conditional probabilities/reward values as decision trees

  4. Goal: Learn Task Hierarchies • Avoid the significant manual engineering of task decomposition • Requiring deep understanding of the purpose and function of subroutines in computer science • Frameworks for learning exit-option hierarchies: • HexQ: Determine exit states through random exploration • VISA: Determine exit states by analyzing DBN action models

  5. Focused Creation of Subtasks • HEXQ & VISA: Create a separate subtask for each possible exit state. • This can generate a large number of subtasks • Claim: Defining good subtasks requires maximizing state abstraction while identifying “useful” subgoals. • Our approach: selectively define subtasks with single abstract exit states

  6. Transfer Learning Scenario • Working hypothesis: • MaxQ value-function learning is much quicker than non-hierarchical (flat) Q-learning • Hierarchical structure is more amenable to transfer from source tasks to the target than value functions • Transfer scenario: • Solve a “source problem” (no CPU time limit) • Learn DBN models • Learn MAXQ hierarchy • Solve a “target problem” under the assumption that the same hierarchical structure applies • Will relax this constraint in future work

  7. Rt+1 At Xt Xt+1 Yt Yt+1 MaxNode State Abstraction • Y is irrelevant within this action • It affects the dynamics but not the reward function • In HEXQ, VISA, and our work, we assume there is only one terminal abstract state, hence no pseudo-reward is needed • As a side-effect, this enables “funnel” abstractions in parent tasks

  8. Our Approach: AI-MAXQ Learn DBN action models via random exploration (Other work) Apply Q learning to solve the source problem Generate a good trajectory from the learned Q function Analyze trajectory to produce CAT (This Talk) Analyze CAT to define MAXQ Hierarchy (This Talk)

  9. Wargus Resource-Gathering Domain

  10. Causally Annotated Trajectory (CAT) req.wood a.* reg.* req.wood a.r a.r a.r a.r a.l Start Goto MG Goto Dep Goto CW Goto Dep End reg.* req.gold req.gold A variable v is relevant to an action if the DBN for that action tests or changes that variable (this includes both the variable nodes and the reward nodes) Create an arc from A to B labeled with variable v iff v is relevant to A and B but not to any intermediate actions.

  11. CAT Scan Start Goto MG Goto Dep Goto CW Goto Dep End • An action is absorbed regressively as long as • It does not have an effect beyond the trajectory segment, preventing exogenous effects • It does not increase the state abstraction

  12. Start Goto MG Goto Dep Goto CW Goto Dep End CAT Scan

  13. Start Goto MG Goto Dep Goto CW Goto Dep End CAT Scan Root

  14. CAT Scan Start Goto MG Goto Dep Goto CW Goto Dep End Root Harvest Gold Harvest Wood

  15. Root Harvest Gold Harvest Wood Get Gold Put Gold Get Wood Put Wood Mine Gold GDeposit Chop Wood WDeposit GGoto(goldmine) GGoto(townhall) WGoto(forest) WGoto(townhall) Goto(loc) Induced Wargus Hierarchy

  16. Induced Abstraction & Termination Note that because each subtask has a unique terminal state, Result Distribution Irrelevance applies

  17. Claims • The resulting hierarchy is unique • Does not depend on the order in which goals and trajectory sequences are analyzed • All state abstractions are safe • There exists a hierarchical policy within the induced hierarchy that will reproduce the observed trajectory • Extend MaxQ Node Irrelevance to the induced structure • Learned hierarchical structure is “locally optimal” • No local change in the trajectory segmentation can improve the state abstractions (very weak)

  18. Experimental Setup • Randomly generate pairs of source-target resource-gathering maps in Wargus • Learn the optimal policy in source • Induce task hierarchy from a single (near) optimal trajectory • Transfer this hierarchical structure to the MaxQ value-function learner for target • Compare to direct Q learning, and MaxQ learning on a manually engineered hierarchy within target

  19. Root Get Gold Get Wood GWDeposit Mine Gold Chop Wood Deposit Goto(loc) Hand-Built Wargus Hierarchy

  20. Hand-Built Abstractions & Terminations

  21. Results: Wargus

  22. Need For Demonstrations • VISA only uses DBNs for causal information • Globally applicable across state space without focusing on the pertinent subspace • Problems • Global variable coupling might prevent concise abstraction • Exit states can grow exponentially: one for each path in the decision tree encoding • Modified bitflip domain exposes these shortcomings

  23. Modified Bitflip Domain • State space: b0,…,bn-1 • Action space: • Flip(i), 0 < i < n-1 • If b0 …  bi-1 = 1 then bi← ~bi • Else b0 ← 0, …, bi ← 0 • Flip(n-1) • If parity(b0, …,bn-2)  bn-2 = 1, bn-1← ~bn-1 • Else b0 ← 0, …, bn-1 ← 0 • parity(…) = even if n-1 is even, odd otherwise • Reward: -1 for all actions • Terminal/goal state: b0 …  bn-1 = 1

  24. Modified Bitflip Domain 1 1 1 0 0 0 0 Flip(3) 1 1 1 1 0 0 0 Flip(1) 1 0 1 1 0 0 0 Flip(4) 0 0 0 0 0 0 0

  25. VISA’s Causal Graph Flip(n-2) • Variables grouped into two strongly connected components (dashed ellipses) • Both components affect the reward node Flip(2) Flip(n-2) Flip(n-2) Flip(1) Flip(n-1) Flip(2) b0 b1 b2 R bn-2 bn-1 Flip(2) Flip(n-1) Flip(3) Flip(n-1) Flip(3) Flip(n-1)

  26. Root 2n-3 exit options Flip(n-1) Parity(b0,…,bn-2)  bn-2 = 1 Flip(n-1) Flip(0) Flip(1) VISA task hierarchy

  27. Bitflip CAT bn-1 bn-2 b1 b0,…,bn-1 b0,…,bn-2 b0 b0 Flip(1) Start Flip(0) Flip(n-2) Flip(n-1) End

  28. Induced MAXQ task hierarchy Root Flip(n-1) b0…bn-2= 1 Flip(n-2) b0…bn-3 = 1 Flip(n-3) b0 b1 = 1 Flip(0) Flip(1)

  29. Results: Bitflip

  30. Conclusion • Causality analysis is the key to our approach • Enables us to find concise subtask definitions from a demonstration • CAT scan is easy to perform • Need to extend to learn from multiple demonstrations • Disjunctive goals

More Related