1 / 19

Mutually-guided Multi-agent Learning

Mutually-guided Multi-agent Learning. Raghav Aras Alain Dutech François Charpillet (MAIA) June 2004. Outline. A review of some Multiagent Q-learning approaches Our approach for Multiagent learning in a stochastic game Some preliminary results. Multiagent Q-Learning (1).

gloria
Download Presentation

Mutually-guided Multi-agent Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mutually-guided Multi-agent Learning Raghav Aras Alain Dutech François Charpillet (MAIA) June 2004

  2. Outline • A review of some Multiagent Q-learning approaches • Our approach for Multiagent learning in a stochastic game • Some preliminary results

  3. Multiagent Q-Learning (1) Q-Learning:(single-agent learning) • Q(st, at)  (1 - ) Q(st, at) +  [Rt +  maxa Q(st+1, a)] • Known to converge to optimal values minimax-Q:(for zero-sum, 2-player games) • V1(s) ← maxP1 ∈ Π(A1) mina2 ∈ A2a1∈A1 P1(a1)Q1(s,(a1, a2)) • Known to converge to optimal values

  4. Multiagent Q-Learning (2) Nash-Q learning: (for n-agent, general sum SGs) • Qi(s, a1,..,an)  (1 - ) Qi(s, a1,..,an) +  [Ri +  NashQi(s’)] where NashQi(s’) = Qi(s’, 1(s’) 2(s’)…n(s’)) • Convergences under strict conditions (existence, uniqueness of Nash equilibria)

  5. Drawbacks of NashQ learning • Coordination in choice of Nash Equilibrium • Observability of all actions and all rewards • Space complexity (each agent): n |S||A|n

  6. The problem that we treat… n-agent SG <S, A1..An, R1..Rn, P>: • Ri: S x (A1 x…An)   • P: S x (A1 x…An) x S  {0,1} (deterministic) • i  S (set of equally good goal states) •  = 1 2 …n • ||  1 (atleast one common goal state) • An agent’s payoff is the same in all its goal states • Agents’ payoffs may be different in the common goal state

  7. Our Interest • A more realistic assumption for SGs: (Actions, rewards of other agents hidden) • Investigating « independent » learning in SGs (Leading also to scalability) • Using communication to forge cooperation A single-agent learning algorithm giving maximum payoff to a maximum number of agents

  8. Agent 1 sends 1 1 1 0 0 1 0 0 0 0 0 0 2 1 1 3 3 2 4 4 4 5 5 5 Agent 2 gets Agent 3 gets Communication in Our Approach • Agents send and receive ‘ping messages’ • Sending a message is an action A ping message, • …is an n-1 sized array of 0s and 1s • …has no content

  9. Communication based Q-Values • Agent state = <game state, message received> • Agent action = <basic action, message to send> • 2n – 1 possible messages • Mi, agent i message set • State set = S x Mi • Action set = Ai x Mi • Size of Q-value set: S x Mi x Ai x Mi Agent policyi: S x Mi Ai x Mi

  10. What do we envisage the messages doing? • Alert others of proximity to a goal state • Discover the common goal state • Enforce preference for the common goal state Main principle of our algorithm: • Play safe by inverting actual rewards • Create artificial rewards based on messages

  11. The Q-comm Learning algorithm Agent i initial state i<S, > • Loop (each agent) • Select i  <ai, messsend>(Boltzmann, -greedy) • Execute i, observe reward Ri • i  <S’, messrecd> (next state) • RMi(Ri . messsend) + (Ri . messrecd) • Invert reward, Ri  -1 Ri • Qi(i, i)  (1 - ) Qi(i, i) + •  [Ri + RMi+  max Qi(i , )] • i  i,S S’ • Until S   (a goal state)

  12. 9 3 5 8 Test Problem: Find the Winning Number (FWN) An n-digit array (number) controlled by n agents • Each agent controls a digit • Actions: +1, -1, 0 • i, list of « winning » numbers for i (unknown) • Each num  i gives equal payoff to i •  = 1 2 …n •  contains a common « winning » number

  13. Results (1): 3 agent FWN 1 = {2, 16, 119}, 2 = {68, 102, 119} , 3 = {37, 86, 119}

  14. Results (2): 3 agent FWN 1 = {2, 16, 119}, 2 = {68, 102, 119} , 3 = {37, 86, 119}

  15. Results (3): Multiple Common Goals Agents select one common goal

  16. Results (4): 4 agent FWN Not all agents satisfied!

  17. Summary of Results: • Empirically, Q-comm learning finds the common goal • Works with multiple common goals • Agents coordinate equilibrium choice • Works with upto 3 agents • Doesn’t always work for 4 or more agents

  18. Future work: • Increase scalability by localising communication • Investigate how it can work for n  4 • Analyse convergence

  19. Thank you! Your questions…

More Related