Software multiagent systems cs543
Download
1 / 38

Software Multiagent Systems: CS543 - PowerPoint PPT Presentation


  • 136 Views
  • Uploaded on

Software Multiagent Systems: CS543. Milind Tambe University of Southern California tambe@usc.edu. Dimensions of Multiagent Learning. Ignore others’ learning vs Model others’ learning Cooperative vs Competitive Cooperative Learn to coordinate with others Learning organizational roles

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Software Multiagent Systems: CS543' - admon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Software multiagent systems cs543

Software Multiagent Systems:CS543

Milind Tambe

University of Southern California

tambe@usc.edu


Dimensions of multiagent learning
Dimensions of Multiagent Learning

  • Ignore others’ learning vs Model others’ learning

  • Cooperative vs Competitive

    • Cooperative

      • Learn to coordinate with others

      • Learning organizational roles

    • Competitive (conflicting learning goals)

      • Learning to play better against adversary

      • Opponent modeling

  • We will focus on reinforcement learning: Q-learning methods


Some terminology
Some Terminology

  • Q-learning

  • Model-free vs Model-based


Q learning
Q-learning

  • Q-values: Q(s,a)

  • Related to utility values:

    • U(s) = max Q(s,a)

  • Following equation must hold at equilibrium:

    Q(i,a) = R(i) + SP(j|i,a)* max Q(j,a’)

  • Requires learning a model!

a

J

a’


Td q learning
TD Q-learning

  • Update equation for TD Q-learning is:

    Q(i,a)  Q(i,a) +  (R(i) + max Q( j, a’) – Q(i,a))

  • What if  = 0?

  • What if  = 1?

a’


Q learning agent
Q Learning Agent

  • Q-learning-agent (e) returns an action

    • e is the percept

    • Q: table of action values

    • N: table of state-action frequencies

    • a, the last action

    • I, previous state

    • J  state[e]

    • N[I,a]  N[I,a] +1

    • Q[I,a]  Q[I,a] +  (R(i) + max Q(j,a’) – Q( I, a))

    • I  J

    • Return (action a’ that maximizes f(Q( j, a’), N[j,a’]))

a’


Choosing an action
Choosing an Action….

  • Step 5: choosing the best action to take in state J

    (a’ is the action chosen using f(Q(a’, j), N[a’, j]))

  • Suppose all Q values initially zero, and f(Q(a’, j), N[a’, j]) chooses max Q(a’, j)

  • Suppose after first exploration:

    • Q[J,A1] = 10, Q[J,A2] = 0, Q[J,A3] = 0, Q[J,A4] = 0

What will happen? Is this a problem?


Exploration vs exploitation
Exploration vs Exploitation

  • Tradeoff: immediate good (exploit) vs long-term (explore)

    • Continuous exploration vs stuck to well-known path

  • Key question: How to balance the two?

  • One approach:

    • Give some weight to actions not tried often

    • Avoid actions that are of low utility


Exploration
Exploration

  • Giving “weight” action not tried very often

    f(Q(a’, j), N[a’, j])) = argmax G(Q(a’, j), N[a’, j]))

  • G returns:

    • very high “R” if N(a’, j) < N-VISITS

    • otherwise Q (a’, j)

  • What will be the result of such a function G?

a’


Two frameworks for multiagent learning

Two Frameworks for Multiagent “Learning”

DCOP: Exploration + Exploitation (paper to be posted on the web site) [Jain et al IJCAI’09]

Stochastic games: Multiagent learning to reach N.E. (in our readings)


Dcop framework
DCOP Framework

a1

a2

a3

  • Assign values to distributed variables

  • Optimize total reward

  • No central control



New challenges
New Challenges

  • Reward matrices unknown

    • Algorithms explore environment

  • Maximize total cumulative signal strength

    • Changes measuring of DCOP algorithms

  • Limited time horizon

    • Not explore everything

    • Horizon-aware DCOPs


Dcop framework reward matrix unknown
DCOP Framework: Reward Matrix Unknown

a1

a2

a3

Assigning values to variables = Exploration

Exploration takes time (physical movement)

Limited time; full exploration impossible


Three new algorithms
Three New Algorithms

  • Based on MGM (maximum gain message)

    • Hill climbing

    • Communicate possible gain to neighbors

    • Agent with max gain “moves”

  • Proposed new algorithms:

    • SE-optimistic: Unexplored domain values yield ‘maximum’

      • Optimistic: Maximal Potential Gain Messaging

      • Exploration maximized: look for max value

    • SE-mean: Unexplored domain values yield ‘mean’ reward

      • “Realistic”: Limit exploration, satisfied by mean

    • BE-backtrack : Lookahead given reward function distribution

    • Intelligent: Decision theoretic limit on exploration

Gain =15

Gain =20

a1

a2

a3


Dcop framework reward matrix unknown1
DCOP Framework: Reward Matrix Unknown

a1

a2

a3

What if 20 is max reward?

SE-optimistic: how will it work?


Lookahead
Lookahead

  • Agent decides: ‘explore’ or ‘backtrack’ to explored state

  • Let Rb be the best reward among explored states

  • The agent will explore for T units only if

  • EU(Explore) > EU(backtrack)

  • Expected Utility of Backtrack:

  • EU(backtrack) = Rb*T


Lookahead1
Lookahead

  • Expected utility of explore is calculated as:

    • P(x,n,te) is the first order statistic of choosing the maximum as ‘x’ in te trials

    • E.U.(explore) is sum of three terms:

      • utility of exploring

      • utility of finding a better reward than current Rb

      • utility of failing to find a better reward than current Rb


Sample results jain et al ijcai 09
Sample Results(Jain et al, IJCAI’09)

  • Decision theoretic approach to exploration

  • Interleave with DCOPs


Towards multiagent learning stochastic game

Towards Multiagent Learning:Stochastic Game

Generalize distributed POMDPs

Different payoffs for each player, not a common payoff

Focus on two person stochastic game

Learning algorithms for stochastic games


Stochastic 2 player game
Stochastic 2-player Game

  • States: S

  • Action sets for each player: A1, A2

  • P transition probabilities: P(s’| s, a1, a2)

  • R or Reward: two separate rewards:

    • R1(s, a1,a2), depends on actions of all agents

    • R2(s, a1, a2), depends on actions of all agents

    • If R1(s,a1,a2) + R2(s,a1,a2) = 0, then zero sum game

  • State observable (MDP like)

  • Each player: maximize its own (discounted) sum of rewards


Stochastic game
Stochastic Game

P(s1|s0,a1,a2)

R1(s0)

R2(s0)

R1(s1)

R2(s1)

R1(s2)

R2(s2)

P(s2|s0,a1,a2)

Reward function depends on the state!


Stochastic game1
Stochastic Game

  • How are repeated games related to stochastic games?


Stochastic game2
Stochastic Game

  • Strategies = Policies

  • Since rewards differ for each agent

    • hence expected values differ as well

  • v1(s, π1, π2) gives us the expected value for agent1 in state s, given that agents pursued policies π1, π2

  • Nash equilibrium in stochastic game:

    • pair of strategies (π1*, π2*) such that for all states s

      v1(s, π1*, π2*) >= v1(s, π1, π2*)

      And

      v2(s, π1*, π2*) >= v2(s, π1*, π2)


Nash equlibrium policies
Nash Equlibrium Policies

  • In Stochastic games, we focus on policies that attain Nash equilibrium

    • If we don’t find Nash equilibrium, then players may have an incentive to deviate

    • Search for stability is critical

  • Policies may be randomized; may not be deterministic


Example stochastic game
Example Stochastic game

  • Goalee can move or stay

  • Shooter can move or shoot

  • Zero sum game, goal worth 10 points to shooter

  • Blocking worth 5 points to goalee



Q learning in stochastic games
Q-learning in Stochastic Games

Nash-Q algorithm:

  • Q1(s, a1,a2) - Q value of agent1 for state S

  • Q2(s, a1, a2)  Q value of agent2 for state S

  • Optimal Q values:

  • Q1*(s, a1, a2) = R1(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V1(s’, π1*, π2*)

  • Q2*(s, a1, a2) = R2(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V2(s’, π1*, π2*)



Algorithm
Algorithm

Consider two agents:

  • Each agent maintains m Q-tables, m = number of states

  • For each state, maintain |A1|*|A2| number of entries in the Q-table

    • |A1| for my actions

    • |A2| for other agents’ actions

  • Q-tables for me and for the other agent


Key observation
Key Observation

  • State s’

  • Bimatrix representation: Q1[s’], Q2[s’]

    • Defines a game

    • Can find mixed strategy nash equilibrium for this game

  • Mixed strategy Nash equilibrium:

    • Provides probability distribution for what action to execute


Multiagent q learning
Multiagent Q-Learning

  • Initialize Q tables

  • Loop:

    • Choose action a1 based on π1(s), which is a mixed strategy Nash equilibrium of the game defined by (Q1(s), Q2(s))

    • Observe r1, r2, a2, s’

    • Update Q1(s) and Q2(s) using the equations defined below

  • Q1(s,a1,a2)  Q1(s, a1,a2)+  (R(s)+ λ[ Z1 ]– Q1(s,a1,a2))

  • Z1 = expected reward given N.E. in state s’

    • due to game Q1(s’),Q2(s’)


What do we end up with

What do we end up with:

Agents converging into the Nash equilibrium


Towards multiagent learning
Towards Multiagent Learning

  • Learning “single agent” in a multiagent setting

    • Ignore other agents except for some property like location

    • Ignore that other agents act intentionally, adapt

  • Advantages:

    • Simpler

    • Easily converges


Single agent in multiagent setting
Single Agent in Multiagent Setting

  • RoboCup Soccer Simulation League

  • Players use model-free reinforcement learning to intercept the ball

  • Learn “on line” during the game


Finding 1 online learning specialized by opponent
Finding #1: Online Learning Specialized by Opponent

  • Same player position against two different RoboCup teams:

  • Player 1 (forward) against CMUnited and Andhill

  • Against CMUnited, player turns more aggressively


Finding 2 online learning specialized by role
Finding #2: Online Learning Specialized by Role

Same team against different players

Player 1 (forward) and Player 10 (fullback) against CMUnited


Lessons learned
Lessons Learned

  • Surprise in tests against opponent teams:

  • Significant specialization of intercept with both role & opponent

  • Lesson: Transfer of experience or cross-training may be detrimental


ad