200 likes | 367 Views
Multi-Agent Shared Hierarchy Reinforcement Learning. Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University. Highlights. Sharing value functions Coordination Framework to express sharing & coordination with hierarchies RTS domain.
E N D
Multi-Agent Shared HierarchyReinforcement Learning Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University
Highlights • Sharing value functions • Coordination • Framework to express sharing & coordination with hierarchies • RTS domain
Previous Work • MAXQ, Options, ALisp • Coordination in the hierarchical setting (Makar, Mahadevan) • Sharing flat value functions (Tan) • Concurrent reinforcement learning for multiple effectors (Murthi, Russell, …)
Outline • Average Reward Learning • RTS domain • Hierarchical ARL • MASH framework • Experimental results • Conclusion & future work
( ) N P E r i ( ) i 0 l G i i ¼ = a n ½ s m = N ( ) 1 N ! P E t i i 0 = SMDP • Semi-Markov Decision Process (SMDP) extends MDPs by allowing for temporally extended actions • States S • Actions A • Transition function P(s’, N|s, a) • Reward function R(s’|s, a) • Time function T(s’|s, a) • Given an SMDP, an agent in state s following policy ,
( ( ( ) ) ) ( ( ( ) ) ) t t t ¡ ¡ ¡ r r r s s s a a a ½ ½ ½ s s s a a a 0 0 0 0 2 1 0 0 0 1 0 2 ; ; ; ; ; ; ¤ ¼ ¼ ¸ ( ( ) ) [ ( ( [ ( ) ) ( ( ) ) ) ] ( ( ( ) ) ( ) ) ] ½ ½ ) h h h E E ¼ ¼ ¼ t t t ¡ ¡ + + ¡ + ¢ ¢ ¢ ( ) ( ) s s r r s s a a ½ ½ s s a a r s s a ½ s a = = ¼ 0 0 0 0 0 0 1 1 1 t ¡ ; ; ; ; ; ; r s a ½ s a ; ; s s s s s s 0 2 1 0 n n Parent task Child task Average Reward Learning • Taking action a in state s • Immediate reward r(s, a) • Action duration t(s, a) • Average-adjusted reward = • Optimal policy * maximizes the RHS, and leads to the optimal gain
RTS Domain • Grid world domain • Multiple peasants mine resources (wood, gold) to replenish the home stock • Avoid collisions with one another • Attack the enemy’s base
Root Harvest(l) Deposit Offense(e) Composite Task Pick Goto(k) Put Attack Primitive Task Idle North South East West RTS Domain Task Hierarchy • MAXQ task hierarchy • Original SMDP is split into sub-SMDPs (subtasks) • Solving the Root task solves the entire SMDP • Each subtask Mi is defined by <Bi, Ai, Gi> • State abstraction Bi • Actions Ai • Termination (goal) predicate Gi
( ( ) ) ( ) h h B s s = a a a ( ) ( ) ( ) h f b k i i i i i i t t t ¡ ¢ s r s ½ s s a p r m v e s u a s = i ; / f l l f h i i i i i 0 t t t t s s a e r m n a g o a s a e o r ; o e r w s e = ; ½ ¾ X 0 0 ( ( ) ) ( j ) ( ) h h B P + ¢ m a x s s s a s = i a a ; ( ) A 2 a s i S 0 2 s Hierarchical Average Reward Learning • Value function decomposition for a recursively gain-optimal policy in Hierarchical H learning: • If the state abstractions are sound, • Root task = Bellman equation
Hierarchical Average Reward Learning • No pseudo rewards • No completion function • Scheduling is a learned behavior
Hierarchical Average Reward Learning • Sharing requires coordination • Coordination part of state not action (Mahadevan) • No need for each subtask to see reward
Root Harvest(l) Deposit Offense(e) Attack Pick Goto(k) Put Idle North South East West North Goto(W1) Harvest(W1) Root Single Hierarchical Agent
Root Root Harvest(l) Harvest(l) Deposit Deposit Offense(e) Offense(e) Attack Attack Pick Pick Goto(k) Goto(k) Put Put Idle Idle North North South South East East West West North Attack Goto(W1) Offense(E1) Harvest(W1) Root Root Simple Multi-Agent Setup
Root Harvest(l) Deposit Offense(e) Attack Pick Goto(k) Put Idle North South East West North Attack Goto(W1) Offense(E1) Harvest(W1) Root Root MASH Setup
Experimental Results • 2 agents in a 15 x 15 grid, Pr(Resource Regeneration) = 5%; Pr(Enemy) = 1%; Rewards = (-1, 100, -5, 50); 30 runs • 4 agents in a 25 × 25 grid, Pr(Resource Regeneration) = 7.5%; Pr(Enemy) = 1%; Rewards = (0, 100, -5, 50); 30 runs • Couldn’t run separate agents coordination for 4 agents 25 × 25
Conclusion • Sharing value functions • Coordination • Framework to express sharing & coordination with hierarchies
Future Work • Non-Markovian & non-stationary • Learning the task hierarchy • Task – subtask relationships • State abstractions • Termination conditions • Combining MASH framework with factored action models • Recognizing opportunities for sharing & coordination
Current Work • Murthi, Russell features