1 / 38

Software Multiagent Systems: CS543

Software Multiagent Systems: CS543. Milind Tambe University of Southern California tambe@usc.edu. Dimensions of Multiagent Learning. Ignore others’ learning vs Model others’ learning Cooperative vs Competitive Cooperative Learn to coordinate with others Learning organizational roles

admon
Download Presentation

Software Multiagent Systems: CS543

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Multiagent Systems:CS543 Milind Tambe University of Southern California tambe@usc.edu

  2. Dimensions of Multiagent Learning • Ignore others’ learning vs Model others’ learning • Cooperative vs Competitive • Cooperative • Learn to coordinate with others • Learning organizational roles • Competitive (conflicting learning goals) • Learning to play better against adversary • Opponent modeling • We will focus on reinforcement learning: Q-learning methods

  3. Some Terminology • Q-learning • Model-free vs Model-based

  4. Q-learning • Q-values: Q(s,a) • Related to utility values: • U(s) = max Q(s,a) • Following equation must hold at equilibrium: Q(i,a) = R(i) + SP(j|i,a)* max Q(j,a’) • Requires learning a model! a J a’

  5. TD Q-learning • Update equation for TD Q-learning is: Q(i,a)  Q(i,a) +  (R(i) + max Q( j, a’) – Q(i,a)) • What if  = 0? • What if  = 1? a’

  6. Q Learning Agent • Q-learning-agent (e) returns an action • e is the percept • Q: table of action values • N: table of state-action frequencies • a, the last action • I, previous state • J  state[e] • N[I,a]  N[I,a] +1 • Q[I,a]  Q[I,a] +  (R(i) + max Q(j,a’) – Q( I, a)) • I  J • Return (action a’ that maximizes f(Q( j, a’), N[j,a’])) a’

  7. Choosing an Action…. • Step 5: choosing the best action to take in state J (a’ is the action chosen using f(Q(a’, j), N[a’, j])) • Suppose all Q values initially zero, and f(Q(a’, j), N[a’, j]) chooses max Q(a’, j) • Suppose after first exploration: • Q[J,A1] = 10, Q[J,A2] = 0, Q[J,A3] = 0, Q[J,A4] = 0 What will happen? Is this a problem?

  8. Exploration vs Exploitation • Tradeoff: immediate good (exploit) vs long-term (explore) • Continuous exploration vs stuck to well-known path • Key question: How to balance the two? • One approach: • Give some weight to actions not tried often • Avoid actions that are of low utility

  9. Exploration • Giving “weight” action not tried very often f(Q(a’, j), N[a’, j])) = argmax G(Q(a’, j), N[a’, j])) • G returns: • very high “R” if N(a’, j) < N-VISITS • otherwise Q (a’, j) • What will be the result of such a function G? a’

  10. Two Frameworks for Multiagent “Learning” DCOP: Exploration + Exploitation (paper to be posted on the web site) [Jain et al IJCAI’09] Stochastic games: Multiagent learning to reach N.E. (in our readings)

  11. DCOP Framework a1 a2 a3 • Assign values to distributed variables • Optimize total reward • No central control

  12. DCOPS for Mobile Sensor Setworks (with Lockheed ATL)

  13. New Challenges • Reward matrices unknown • Algorithms explore environment • Maximize total cumulative signal strength • Changes measuring of DCOP algorithms • Limited time horizon • Not explore everything • Horizon-aware DCOPs

  14. DCOP Framework: Reward Matrix Unknown a1 a2 a3 Assigning values to variables = Exploration Exploration takes time (physical movement) Limited time; full exploration impossible

  15. Three New Algorithms • Based on MGM (maximum gain message) • Hill climbing • Communicate possible gain to neighbors • Agent with max gain “moves” • Proposed new algorithms: • SE-optimistic: Unexplored domain values yield ‘maximum’ • Optimistic: Maximal Potential Gain Messaging • Exploration maximized: look for max value • SE-mean: Unexplored domain values yield ‘mean’ reward • “Realistic”: Limit exploration, satisfied by mean • BE-backtrack : Lookahead given reward function distribution • Intelligent: Decision theoretic limit on exploration Gain =15 Gain =20 a1 a2 a3

  16. DCOP Framework: Reward Matrix Unknown a1 a2 a3 What if 20 is max reward? SE-optimistic: how will it work?

  17. Lookahead • Agent decides: ‘explore’ or ‘backtrack’ to explored state • Let Rb be the best reward among explored states • The agent will explore for T units only if • EU(Explore) > EU(backtrack) • Expected Utility of Backtrack: • EU(backtrack) = Rb*T

  18. Lookahead • Expected utility of explore is calculated as: • P(x,n,te) is the first order statistic of choosing the maximum as ‘x’ in te trials • E.U.(explore) is sum of three terms: • utility of exploring • utility of finding a better reward than current Rb • utility of failing to find a better reward than current Rb

  19. Sample Results(Jain et al, IJCAI’09) • Decision theoretic approach to exploration • Interleave with DCOPs

  20. Towards Multiagent Learning:Stochastic Game Generalize distributed POMDPs Different payoffs for each player, not a common payoff Focus on two person stochastic game Learning algorithms for stochastic games

  21. Stochastic 2-player Game • States: S • Action sets for each player: A1, A2 • P transition probabilities: P(s’| s, a1, a2) • R or Reward: two separate rewards: • R1(s, a1,a2), depends on actions of all agents • R2(s, a1, a2), depends on actions of all agents • If R1(s,a1,a2) + R2(s,a1,a2) = 0, then zero sum game • State observable (MDP like) • Each player: maximize its own (discounted) sum of rewards

  22. Stochastic Game P(s1|s0,a1,a2) R1(s0) R2(s0) R1(s1) R2(s1) R1(s2) R2(s2) P(s2|s0,a1,a2) Reward function depends on the state!

  23. Stochastic Game • How are repeated games related to stochastic games?

  24. Stochastic Game • Strategies = Policies • Since rewards differ for each agent • hence expected values differ as well • v1(s, π1, π2) gives us the expected value for agent1 in state s, given that agents pursued policies π1, π2 • Nash equilibrium in stochastic game: • pair of strategies (π1*, π2*) such that for all states s v1(s, π1*, π2*) >= v1(s, π1, π2*) And v2(s, π1*, π2*) >= v2(s, π1*, π2)

  25. Nash Equlibrium Policies • In Stochastic games, we focus on policies that attain Nash equilibrium • If we don’t find Nash equilibrium, then players may have an incentive to deviate • Search for stability is critical • Policies may be randomized; may not be deterministic

  26. Example Stochastic game • Goalee can move or stay • Shooter can move or shoot • Zero sum game, goal worth 10 points to shooter • Blocking worth 5 points to goalee

  27. Work out example

  28. Q-learning in Stochastic Games Nash-Q algorithm: • Q1(s, a1,a2) - Q value of agent1 for state S • Q2(s, a1, a2)  Q value of agent2 for state S • Optimal Q values: • Q1*(s, a1, a2) = R1(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V1(s’, π1*, π2*) • Q2*(s, a1, a2) = R2(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V2(s’, π1*, π2*)

  29. Example

  30. Algorithm Consider two agents: • Each agent maintains m Q-tables, m = number of states • For each state, maintain |A1|*|A2| number of entries in the Q-table • |A1| for my actions • |A2| for other agents’ actions • Q-tables for me and for the other agent

  31. Key Observation • State s’ • Bimatrix representation: Q1[s’], Q2[s’] • Defines a game • Can find mixed strategy nash equilibrium for this game • Mixed strategy Nash equilibrium: • Provides probability distribution for what action to execute

  32. Multiagent Q-Learning • Initialize Q tables • Loop: • Choose action a1 based on π1(s), which is a mixed strategy Nash equilibrium of the game defined by (Q1(s), Q2(s)) • Observe r1, r2, a2, s’ • Update Q1(s) and Q2(s) using the equations defined below • Q1(s,a1,a2)  Q1(s, a1,a2)+  (R(s)+ λ[ Z1 ]– Q1(s,a1,a2)) • Z1 = expected reward given N.E. in state s’ • due to game Q1(s’),Q2(s’)

  33. What do we end up with: Agents converging into the Nash equilibrium

  34. Towards Multiagent Learning • Learning “single agent” in a multiagent setting • Ignore other agents except for some property like location • Ignore that other agents act intentionally, adapt • Advantages: • Simpler • Easily converges

  35. Single Agent in Multiagent Setting • RoboCup Soccer Simulation League • Players use model-free reinforcement learning to intercept the ball • Learn “on line” during the game

  36. Finding #1: Online Learning Specialized by Opponent • Same player position against two different RoboCup teams: • Player 1 (forward) against CMUnited and Andhill • Against CMUnited, player turns more aggressively

  37. Finding #2: Online Learning Specialized by Role Same team against different players Player 1 (forward) and Player 10 (fullback) against CMUnited

  38. Lessons Learned • Surprise in tests against opponent teams: • Significant specialization of intercept with both role & opponent • Lesson: Transfer of experience or cross-training may be detrimental

More Related