1 / 20

Multi-Agent Shared Hierarchy Reinforcement Learning

Multi-Agent Shared Hierarchy Reinforcement Learning. Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University. Highlights. Sharing value functions Coordination Framework to express sharing & coordination with hierarchies RTS domain.

gray-rhodes
Download Presentation

Multi-Agent Shared Hierarchy Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-Agent Shared HierarchyReinforcement Learning Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University

  2. Highlights • Sharing value functions • Coordination • Framework to express sharing & coordination with hierarchies • RTS domain

  3. Previous Work • MAXQ, Options, ALisp • Coordination in the hierarchical setting (Makar, Mahadevan) • Sharing flat value functions (Tan) • Concurrent reinforcement learning for multiple effectors (Murthi, Russell, …)

  4. Outline • Average Reward Learning • RTS domain • Hierarchical ARL • MASH framework • Experimental results • Conclusion & future work

  5. ( ) N P E r i ( ) i 0 l G i i ¼ = a n ½ s m = N ( ) 1 N ! P E t i i 0 = SMDP • Semi-Markov Decision Process (SMDP) extends MDPs by allowing for temporally extended actions • States S • Actions A • Transition function P(s’, N|s, a) • Reward function R(s’|s, a) • Time function T(s’|s, a) • Given an SMDP, an agent in state s following policy ,

  6. ( ( ( ) ) ) ( ( ( ) ) ) t t t ¡ ¡ ¡ r r r s s s a a a ½ ½ ½ s s s a a a 0 0 0 0 2 1 0 0 0 1 0 2 ; ; ; ; ; ; ¤ ¼ ¼ ¸ ( ( ) ) [ ( ( [ ( ) ) ( ( ) ) ) ] ( ( ( ) ) ( ) ) ] ½ ½ ) h h h E E ¼ ¼ ¼ t t t ¡ ¡ + + ¡ + ¢ ¢ ¢ ( ) ( ) s s r r s s a a ½ ½ s s a a r s s a ½ s a = = ¼ 0 0 0 0 0 0 1 1 1 t ¡ ; ; ; ; ; ; r s a ½ s a ; ; s s s s s s 0 2 1 0 n n Parent task Child task Average Reward Learning • Taking action a in state s • Immediate reward r(s, a) • Action duration t(s, a) • Average-adjusted reward = • Optimal policy * maximizes the RHS, and leads to the optimal gain

  7. RTS Domain • Grid world domain • Multiple peasants mine resources (wood, gold) to replenish the home stock • Avoid collisions with one another • Attack the enemy’s base

  8. Root Harvest(l) Deposit Offense(e) Composite Task Pick Goto(k) Put Attack Primitive Task Idle North South East West RTS Domain Task Hierarchy • MAXQ task hierarchy • Original SMDP is split into sub-SMDPs (subtasks) • Solving the Root task solves the entire SMDP • Each subtask Mi is defined by <Bi, Ai, Gi> • State abstraction Bi • Actions Ai • Termination (goal) predicate Gi

  9. ( ( ) ) ( ) h h B s s = a a a ( ) ( ) ( ) h f b k i i i i i i t t t ¡ ¢ s r s ½ s s a p r m v e s u a s = i ; / f l l f h i i i i i 0 t t t t s s a e r m n a g o a s a e o r ; o e r w s e = ; ½ ¾ X 0 0 ( ( ) ) ( j ) ( ) h h B P + ¢ m a x s s s a s = i a a ; ( ) A 2 a s i S 0 2 s Hierarchical Average Reward Learning • Value function decomposition for a recursively gain-optimal policy in Hierarchical H learning: • If the state abstractions are sound, • Root task = Bellman equation

  10. Hierarchical Average Reward Learning • No pseudo rewards • No completion function • Scheduling is a learned behavior

  11. Hierarchical Average Reward Learning • Sharing requires coordination • Coordination part of state not action (Mahadevan) • No need for each subtask to see reward

  12. Root Harvest(l) Deposit Offense(e) Attack Pick Goto(k) Put Idle North South East West North Goto(W1) Harvest(W1) Root Single Hierarchical Agent

  13. Root Root Harvest(l) Harvest(l) Deposit Deposit Offense(e) Offense(e) Attack Attack Pick Pick Goto(k) Goto(k) Put Put Idle Idle North North South South East East West West North Attack Goto(W1) Offense(E1) Harvest(W1) Root Root Simple Multi-Agent Setup

  14. Root Harvest(l) Deposit Offense(e) Attack Pick Goto(k) Put Idle North South East West North Attack Goto(W1) Offense(E1) Harvest(W1) Root Root MASH Setup

  15. Experimental Results • 2 agents in a 15 x 15 grid, Pr(Resource Regeneration) = 5%; Pr(Enemy) = 1%; Rewards = (-1, 100, -5, 50); 30 runs • 4 agents in a 25 × 25 grid, Pr(Resource Regeneration) = 7.5%; Pr(Enemy) = 1%; Rewards = (0, 100, -5, 50); 30 runs • Couldn’t run separate agents coordination for 4 agents 25 × 25

  16. Experimental Results

  17. Experimental Results (2)

  18. Conclusion • Sharing value functions • Coordination • Framework to express sharing & coordination with hierarchies

  19. Future Work • Non-Markovian & non-stationary • Learning the task hierarchy • Task – subtask relationships • State abstractions • Termination conditions • Combining MASH framework with factored action models • Recognizing opportunities for sharing & coordination

  20. Current Work • Murthi, Russell features

More Related