Multi-Agent Shared Hierarchy Reinforcement Learning

Multi-Agent Shared HierarchyReinforcement Learning Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University

Highlights • Sharing value functions • Coordination • Framework to express sharing & coordination with hierarchies • RTS domain

Previous Work • MAXQ, Options, ALisp • Coordination in the hierarchical setting (Makar, Mahadevan) • Sharing flat value functions (Tan) • Concurrent reinforcement learning for multiple effectors (Murthi, Russell, …)

Outline • Average Reward Learning • RTS domain • Hierarchical ARL • MASH framework • Experimental results • Conclusion & future work

( ) N P E r i ( ) i 0 l G i i ¼ = a n ½ s m = N ( ) 1 N ! P E t i i 0 = SMDP • Semi-Markov Decision Process (SMDP) extends MDPs by allowing for temporally extended actions • States S • Actions A • Transition function P(s’, N|s, a) • Reward function R(s’|s, a) • Time function T(s’|s, a) • Given an SMDP, an agent in state s following policy ,

( ( ( ) ) ) ( ( ( ) ) ) t t t ¡ ¡ ¡ r r r s s s a a a ½ ½ ½ s s s a a a 0 0 0 0 2 1 0 0 0 1 0 2 ; ; ; ; ; ; ¤ ¼ ¼ ¸ ( ( ) ) [ ( ( [ ( ) ) ( ( ) ) ) ] ( ( ( ) ) ( ) ) ] ½ ½ ) h h h E E ¼ ¼ ¼ t t t ¡ ¡ + + ¡ + ¢ ¢ ¢ ( ) ( ) s s r r s s a a ½ ½ s s a a r s s a ½ s a = = ¼ 0 0 0 0 0 0 1 1 1 t ¡ ; ; ; ; ; ; r s a ½ s a ; ; s s s s s s 0 2 1 0 n n Parent task Child task Average Reward Learning • Taking action a in state s • Immediate reward r(s, a) • Action duration t(s, a) • Average-adjusted reward = • Optimal policy * maximizes the RHS, and leads to the optimal gain

RTS Domain • Grid world domain • Multiple peasants mine resources (wood, gold) to replenish the home stock • Avoid collisions with one another • Attack the enemy’s base

Root Harvest(l) Deposit Offense(e) Composite Task Pick Goto(k) Put Attack Primitive Task Idle North South East West RTS Domain Task Hierarchy • MAXQ task hierarchy • Original SMDP is split into sub-SMDPs (subtasks) • Solving the Root task solves the entire SMDP • Each subtask Mi is defined by <Bi, Ai, Gi> • State abstraction Bi • Actions Ai • Termination (goal) predicate Gi

( ( ) ) ( ) h h B s s = a a a ( ) ( ) ( ) h f b k i i i i i i t t t ¡ ¢ s r s ½ s s a p r m v e s u a s = i ; / f l l f h i i i i i 0 t t t t s s a e r m n a g o a s a e o r ; o e r w s e = ; ½ ¾ X 0 0 ( ( ) ) ( j ) ( ) h h B P + ¢ m a x s s s a s = i a a ; ( ) A 2 a s i S 0 2 s Hierarchical Average Reward Learning • Value function decomposition for a recursively gain-optimal policy in Hierarchical H learning: • If the state abstractions are sound, • Root task = Bellman equation

Hierarchical Average Reward Learning • No pseudo rewards • No completion function • Scheduling is a learned behavior

Hierarchical Average Reward Learning • Sharing requires coordination • Coordination part of state not action (Mahadevan) • No need for each subtask to see reward

Root Harvest(l) Deposit Offense(e) Attack Pick Goto(k) Put Idle North South East West North Goto(W1) Harvest(W1) Root Single Hierarchical Agent

Root Root Harvest(l) Harvest(l) Deposit Deposit Offense(e) Offense(e) Attack Attack Pick Pick Goto(k) Goto(k) Put Put Idle Idle North North South South East East West West North Attack Goto(W1) Offense(E1) Harvest(W1) Root Root Simple Multi-Agent Setup

Root Harvest(l) Deposit Offense(e) Attack Pick Goto(k) Put Idle North South East West North Attack Goto(W1) Offense(E1) Harvest(W1) Root Root MASH Setup

Experimental Results • 2 agents in a 15 x 15 grid, Pr(Resource Regeneration) = 5%; Pr(Enemy) = 1%; Rewards = (-1, 100, -5, 50); 30 runs • 4 agents in a 25 × 25 grid, Pr(Resource Regeneration) = 7.5%; Pr(Enemy) = 1%; Rewards = (0, 100, -5, 50); 30 runs • Couldn’t run separate agents coordination for 4 agents 25 × 25

Experimental Results

Experimental Results (2)

Conclusion • Sharing value functions • Coordination • Framework to express sharing & coordination with hierarchies

Future Work • Non-Markovian & non-stationary • Learning the task hierarchy • Task – subtask relationships • State abstractions • Termination conditions • Combining MASH framework with factored action models • Recognizing opportunities for sharing & coordination

Current Work • Murthi, Russell features

Multi-Agent Shared Hierarchy Reinforcement Learning