software multiagent systems cs543
Download
Skip this Video
Download Presentation
Software Multiagent Systems: CS543

Loading in 2 Seconds...

play fullscreen
1 / 38

Software Multiagent Systems: CS543 - PowerPoint PPT Presentation


  • 136 Views
  • Uploaded on

Software Multiagent Systems: CS543. Milind Tambe University of Southern California [email protected] Dimensions of Multiagent Learning. Ignore others’ learning vs Model others’ learning Cooperative vs Competitive Cooperative Learn to coordinate with others Learning organizational roles

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Software Multiagent Systems: CS543' - admon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
software multiagent systems cs543

Software Multiagent Systems:CS543

Milind Tambe

University of Southern California

[email protected]

dimensions of multiagent learning
Dimensions of Multiagent Learning
  • Ignore others’ learning vs Model others’ learning
  • Cooperative vs Competitive
    • Cooperative
      • Learn to coordinate with others
      • Learning organizational roles
    • Competitive (conflicting learning goals)
      • Learning to play better against adversary
      • Opponent modeling
  • We will focus on reinforcement learning: Q-learning methods
some terminology
Some Terminology
  • Q-learning
  • Model-free vs Model-based
q learning
Q-learning
  • Q-values: Q(s,a)
  • Related to utility values:
    • U(s) = max Q(s,a)
  • Following equation must hold at equilibrium:

Q(i,a) = R(i) + SP(j|i,a)* max Q(j,a’)

  • Requires learning a model!

a

J

a’

td q learning
TD Q-learning
  • Update equation for TD Q-learning is:

Q(i,a)  Q(i,a) +  (R(i) + max Q( j, a’) – Q(i,a))

  • What if  = 0?
  • What if  = 1?

a’

q learning agent
Q Learning Agent
  • Q-learning-agent (e) returns an action
    • e is the percept
    • Q: table of action values
    • N: table of state-action frequencies
    • a, the last action
    • I, previous state
    • J  state[e]
    • N[I,a]  N[I,a] +1
    • Q[I,a]  Q[I,a] +  (R(i) + max Q(j,a’) – Q( I, a))
    • I  J
    • Return (action a’ that maximizes f(Q( j, a’), N[j,a’]))

a’

choosing an action
Choosing an Action….
  • Step 5: choosing the best action to take in state J

(a’ is the action chosen using f(Q(a’, j), N[a’, j]))

  • Suppose all Q values initially zero, and f(Q(a’, j), N[a’, j]) chooses max Q(a’, j)
  • Suppose after first exploration:
    • Q[J,A1] = 10, Q[J,A2] = 0, Q[J,A3] = 0, Q[J,A4] = 0

What will happen? Is this a problem?

exploration vs exploitation
Exploration vs Exploitation
  • Tradeoff: immediate good (exploit) vs long-term (explore)
    • Continuous exploration vs stuck to well-known path
  • Key question: How to balance the two?
  • One approach:
    • Give some weight to actions not tried often
    • Avoid actions that are of low utility
exploration
Exploration
  • Giving “weight” action not tried very often

f(Q(a’, j), N[a’, j])) = argmax G(Q(a’, j), N[a’, j]))

  • G returns:
    • very high “R” if N(a’, j) < N-VISITS
    • otherwise Q (a’, j)
  • What will be the result of such a function G?

a’

two frameworks for multiagent learning

Two Frameworks for Multiagent “Learning”

DCOP: Exploration + Exploitation (paper to be posted on the web site) [Jain et al IJCAI’09]

Stochastic games: Multiagent learning to reach N.E. (in our readings)

dcop framework
DCOP Framework

a1

a2

a3

  • Assign values to distributed variables
  • Optimize total reward
  • No central control
new challenges
New Challenges
  • Reward matrices unknown
    • Algorithms explore environment
  • Maximize total cumulative signal strength
    • Changes measuring of DCOP algorithms
  • Limited time horizon
    • Not explore everything
    • Horizon-aware DCOPs
dcop framework reward matrix unknown
DCOP Framework: Reward Matrix Unknown

a1

a2

a3

Assigning values to variables = Exploration

Exploration takes time (physical movement)

Limited time; full exploration impossible

three new algorithms
Three New Algorithms
  • Based on MGM (maximum gain message)
    • Hill climbing
    • Communicate possible gain to neighbors
    • Agent with max gain “moves”
  • Proposed new algorithms:
    • SE-optimistic: Unexplored domain values yield ‘maximum’
      • Optimistic: Maximal Potential Gain Messaging
      • Exploration maximized: look for max value
    • SE-mean: Unexplored domain values yield ‘mean’ reward
      • “Realistic”: Limit exploration, satisfied by mean
    • BE-backtrack : Lookahead given reward function distribution
    • Intelligent: Decision theoretic limit on exploration

Gain =15

Gain =20

a1

a2

a3

dcop framework reward matrix unknown1
DCOP Framework: Reward Matrix Unknown

a1

a2

a3

What if 20 is max reward?

SE-optimistic: how will it work?

lookahead
Lookahead
  • Agent decides: ‘explore’ or ‘backtrack’ to explored state
  • Let Rb be the best reward among explored states
  • The agent will explore for T units only if
  • EU(Explore) > EU(backtrack)
  • Expected Utility of Backtrack:
  • EU(backtrack) = Rb*T
lookahead1
Lookahead
  • Expected utility of explore is calculated as:
    • P(x,n,te) is the first order statistic of choosing the maximum as ‘x’ in te trials
    • E.U.(explore) is sum of three terms:
      • utility of exploring
      • utility of finding a better reward than current Rb
      • utility of failing to find a better reward than current Rb
sample results jain et al ijcai 09
Sample Results(Jain et al, IJCAI’09)
  • Decision theoretic approach to exploration
  • Interleave with DCOPs
towards multiagent learning stochastic game

Towards Multiagent Learning:Stochastic Game

Generalize distributed POMDPs

Different payoffs for each player, not a common payoff

Focus on two person stochastic game

Learning algorithms for stochastic games

stochastic 2 player game
Stochastic 2-player Game
  • States: S
  • Action sets for each player: A1, A2
  • P transition probabilities: P(s’| s, a1, a2)
  • R or Reward: two separate rewards:
    • R1(s, a1,a2), depends on actions of all agents
    • R2(s, a1, a2), depends on actions of all agents
    • If R1(s,a1,a2) + R2(s,a1,a2) = 0, then zero sum game
  • State observable (MDP like)
  • Each player: maximize its own (discounted) sum of rewards
stochastic game
Stochastic Game

P(s1|s0,a1,a2)

R1(s0)

R2(s0)

R1(s1)

R2(s1)

R1(s2)

R2(s2)

P(s2|s0,a1,a2)

Reward function depends on the state!

stochastic game1
Stochastic Game
  • How are repeated games related to stochastic games?
stochastic game2
Stochastic Game
  • Strategies = Policies
  • Since rewards differ for each agent
    • hence expected values differ as well
  • v1(s, π1, π2) gives us the expected value for agent1 in state s, given that agents pursued policies π1, π2
  • Nash equilibrium in stochastic game:
    • pair of strategies (π1*, π2*) such that for all states s

v1(s, π1*, π2*) >= v1(s, π1, π2*)

And

v2(s, π1*, π2*) >= v2(s, π1*, π2)

nash equlibrium policies
Nash Equlibrium Policies
  • In Stochastic games, we focus on policies that attain Nash equilibrium
    • If we don’t find Nash equilibrium, then players may have an incentive to deviate
    • Search for stability is critical
  • Policies may be randomized; may not be deterministic
example stochastic game
Example Stochastic game
  • Goalee can move or stay
  • Shooter can move or shoot
  • Zero sum game, goal worth 10 points to shooter
  • Blocking worth 5 points to goalee
q learning in stochastic games
Q-learning in Stochastic Games

Nash-Q algorithm:

  • Q1(s, a1,a2) - Q value of agent1 for state S
  • Q2(s, a1, a2)  Q value of agent2 for state S
  • Optimal Q values:
  • Q1*(s, a1, a2) = R1(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V1(s’, π1*, π2*)
  • Q2*(s, a1, a2) = R2(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V2(s’, π1*, π2*)
algorithm
Algorithm

Consider two agents:

  • Each agent maintains m Q-tables, m = number of states
  • For each state, maintain |A1|*|A2| number of entries in the Q-table
    • |A1| for my actions
    • |A2| for other agents’ actions
  • Q-tables for me and for the other agent
key observation
Key Observation
  • State s’
  • Bimatrix representation: Q1[s’], Q2[s’]
    • Defines a game
    • Can find mixed strategy nash equilibrium for this game
  • Mixed strategy Nash equilibrium:
    • Provides probability distribution for what action to execute
multiagent q learning
Multiagent Q-Learning
  • Initialize Q tables
  • Loop:
    • Choose action a1 based on π1(s), which is a mixed strategy Nash equilibrium of the game defined by (Q1(s), Q2(s))
    • Observe r1, r2, a2, s’
    • Update Q1(s) and Q2(s) using the equations defined below
  • Q1(s,a1,a2)  Q1(s, a1,a2)+  (R(s)+ λ[ Z1 ]– Q1(s,a1,a2))
  • Z1 = expected reward given N.E. in state s’
    • due to game Q1(s’),Q2(s’)
what do we end up with

What do we end up with:

Agents converging into the Nash equilibrium

towards multiagent learning
Towards Multiagent Learning
  • Learning “single agent” in a multiagent setting
    • Ignore other agents except for some property like location
    • Ignore that other agents act intentionally, adapt
  • Advantages:
    • Simpler
    • Easily converges
single agent in multiagent setting
Single Agent in Multiagent Setting
  • RoboCup Soccer Simulation League
  • Players use model-free reinforcement learning to intercept the ball
  • Learn “on line” during the game
finding 1 online learning specialized by opponent
Finding #1: Online Learning Specialized by Opponent
  • Same player position against two different RoboCup teams:
  • Player 1 (forward) against CMUnited and Andhill
  • Against CMUnited, player turns more aggressively
finding 2 online learning specialized by role
Finding #2: Online Learning Specialized by Role

Same team against different players

Player 1 (forward) and Player 10 (fullback) against CMUnited

lessons learned
Lessons Learned
  • Surprise in tests against opponent teams:
  • Significant specialization of intercept with both role & opponent
  • Lesson: Transfer of experience or cross-training may be detrimental
ad