1 / 38

# Software Multiagent Systems: CS543 - PowerPoint PPT Presentation

Software Multiagent Systems: CS543. Milind Tambe University of Southern California tambe@usc.edu. Dimensions of Multiagent Learning. Ignore others’ learning vs Model others’ learning Cooperative vs Competitive Cooperative Learn to coordinate with others Learning organizational roles

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Software Multiagent Systems:CS543

Milind Tambe

University of Southern California

tambe@usc.edu

• Ignore others’ learning vs Model others’ learning

• Cooperative vs Competitive

• Cooperative

• Learn to coordinate with others

• Learning organizational roles

• Competitive (conflicting learning goals)

• Learning to play better against adversary

• Opponent modeling

• We will focus on reinforcement learning: Q-learning methods

• Q-learning

• Model-free vs Model-based

• Q-values: Q(s,a)

• Related to utility values:

• U(s) = max Q(s,a)

• Following equation must hold at equilibrium:

Q(i,a) = R(i) + SP(j|i,a)* max Q(j,a’)

• Requires learning a model!

a

J

a’

• Update equation for TD Q-learning is:

Q(i,a)  Q(i,a) +  (R(i) + max Q( j, a’) – Q(i,a))

• What if  = 0?

• What if  = 1?

a’

• Q-learning-agent (e) returns an action

• e is the percept

• Q: table of action values

• N: table of state-action frequencies

• a, the last action

• I, previous state

• J  state[e]

• N[I,a]  N[I,a] +1

• Q[I,a]  Q[I,a] +  (R(i) + max Q(j,a’) – Q( I, a))

• I  J

• Return (action a’ that maximizes f(Q( j, a’), N[j,a’]))

a’

• Step 5: choosing the best action to take in state J

(a’ is the action chosen using f(Q(a’, j), N[a’, j]))

• Suppose all Q values initially zero, and f(Q(a’, j), N[a’, j]) chooses max Q(a’, j)

• Suppose after first exploration:

• Q[J,A1] = 10, Q[J,A2] = 0, Q[J,A3] = 0, Q[J,A4] = 0

What will happen? Is this a problem?

• Tradeoff: immediate good (exploit) vs long-term (explore)

• Continuous exploration vs stuck to well-known path

• Key question: How to balance the two?

• One approach:

• Give some weight to actions not tried often

• Avoid actions that are of low utility

• Giving “weight” action not tried very often

f(Q(a’, j), N[a’, j])) = argmax G(Q(a’, j), N[a’, j]))

• G returns:

• very high “R” if N(a’, j) < N-VISITS

• otherwise Q (a’, j)

• What will be the result of such a function G?

a’

### Two Frameworks for Multiagent “Learning”

DCOP: Exploration + Exploitation (paper to be posted on the web site) [Jain et al IJCAI’09]

Stochastic games: Multiagent learning to reach N.E. (in our readings)

a1

a2

a3

• Assign values to distributed variables

• Optimize total reward

• No central control

(with Lockheed ATL)

• Reward matrices unknown

• Algorithms explore environment

• Maximize total cumulative signal strength

• Changes measuring of DCOP algorithms

• Limited time horizon

• Not explore everything

• Horizon-aware DCOPs

a1

a2

a3

Assigning values to variables = Exploration

Exploration takes time (physical movement)

Limited time; full exploration impossible

• Based on MGM (maximum gain message)

• Hill climbing

• Communicate possible gain to neighbors

• Agent with max gain “moves”

• Proposed new algorithms:

• SE-optimistic: Unexplored domain values yield ‘maximum’

• Optimistic: Maximal Potential Gain Messaging

• Exploration maximized: look for max value

• SE-mean: Unexplored domain values yield ‘mean’ reward

• “Realistic”: Limit exploration, satisfied by mean

• BE-backtrack : Lookahead given reward function distribution

• Intelligent: Decision theoretic limit on exploration

Gain =15

Gain =20

a1

a2

a3

a1

a2

a3

What if 20 is max reward?

SE-optimistic: how will it work?

• Agent decides: ‘explore’ or ‘backtrack’ to explored state

• Let Rb be the best reward among explored states

• The agent will explore for T units only if

• EU(Explore) > EU(backtrack)

• Expected Utility of Backtrack:

• EU(backtrack) = Rb*T

• Expected utility of explore is calculated as:

• P(x,n,te) is the first order statistic of choosing the maximum as ‘x’ in te trials

• E.U.(explore) is sum of three terms:

• utility of exploring

• utility of finding a better reward than current Rb

• utility of failing to find a better reward than current Rb

Sample Results(Jain et al, IJCAI’09)

• Decision theoretic approach to exploration

• Interleave with DCOPs

### Towards Multiagent Learning:Stochastic Game

Generalize distributed POMDPs

Different payoffs for each player, not a common payoff

Focus on two person stochastic game

Learning algorithms for stochastic games

• States: S

• Action sets for each player: A1, A2

• P transition probabilities: P(s’| s, a1, a2)

• R or Reward: two separate rewards:

• R1(s, a1,a2), depends on actions of all agents

• R2(s, a1, a2), depends on actions of all agents

• If R1(s,a1,a2) + R2(s,a1,a2) = 0, then zero sum game

• State observable (MDP like)

• Each player: maximize its own (discounted) sum of rewards

P(s1|s0,a1,a2)

R1(s0)

R2(s0)

R1(s1)

R2(s1)

R1(s2)

R2(s2)

P(s2|s0,a1,a2)

Reward function depends on the state!

• How are repeated games related to stochastic games?

• Strategies = Policies

• Since rewards differ for each agent

• hence expected values differ as well

• v1(s, π1, π2) gives us the expected value for agent1 in state s, given that agents pursued policies π1, π2

• Nash equilibrium in stochastic game:

• pair of strategies (π1*, π2*) such that for all states s

v1(s, π1*, π2*) >= v1(s, π1, π2*)

And

v2(s, π1*, π2*) >= v2(s, π1*, π2)

• In Stochastic games, we focus on policies that attain Nash equilibrium

• If we don’t find Nash equilibrium, then players may have an incentive to deviate

• Search for stability is critical

• Policies may be randomized; may not be deterministic

• Goalee can move or stay

• Shooter can move or shoot

• Zero sum game, goal worth 10 points to shooter

• Blocking worth 5 points to goalee

### Work out example

Nash-Q algorithm:

• Q1(s, a1,a2) - Q value of agent1 for state S

• Q2(s, a1, a2)  Q value of agent2 for state S

• Optimal Q values:

• Q1*(s, a1, a2) = R1(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V1(s’, π1*, π2*)

• Q2*(s, a1, a2) = R2(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V2(s’, π1*, π2*)

### Example

Consider two agents:

• Each agent maintains m Q-tables, m = number of states

• For each state, maintain |A1|*|A2| number of entries in the Q-table

• |A1| for my actions

• |A2| for other agents’ actions

• Q-tables for me and for the other agent

• State s’

• Bimatrix representation: Q1[s’], Q2[s’]

• Defines a game

• Can find mixed strategy nash equilibrium for this game

• Mixed strategy Nash equilibrium:

• Provides probability distribution for what action to execute

• Initialize Q tables

• Loop:

• Choose action a1 based on π1(s), which is a mixed strategy Nash equilibrium of the game defined by (Q1(s), Q2(s))

• Observe r1, r2, a2, s’

• Update Q1(s) and Q2(s) using the equations defined below

• Q1(s,a1,a2)  Q1(s, a1,a2)+  (R(s)+ λ[ Z1 ]– Q1(s,a1,a2))

• Z1 = expected reward given N.E. in state s’

• due to game Q1(s’),Q2(s’)

### What do we end up with:

Agents converging into the Nash equilibrium

• Learning “single agent” in a multiagent setting

• Ignore other agents except for some property like location

• Ignore that other agents act intentionally, adapt

• Simpler

• Easily converges

• RoboCup Soccer Simulation League

• Players use model-free reinforcement learning to intercept the ball

• Learn “on line” during the game

• Same player position against two different RoboCup teams:

• Player 1 (forward) against CMUnited and Andhill

• Against CMUnited, player turns more aggressively

Finding #2: Online Learning Specialized by Role

Same team against different players

Player 1 (forward) and Player 10 (fullback) against CMUnited

• Surprise in tests against opponent teams:

• Significant specialization of intercept with both role & opponent

• Lesson: Transfer of experience or cross-training may be detrimental