Software Multiagent Systems: CS543

1 / 38

# Software Multiagent Systems: CS543 - PowerPoint PPT Presentation

Software Multiagent Systems: CS543. Milind Tambe University of Southern California [email protected] Dimensions of Multiagent Learning. Ignore others’ learning vs Model others’ learning Cooperative vs Competitive Cooperative Learn to coordinate with others Learning organizational roles

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Software Multiagent Systems:CS543

Milind Tambe

University of Southern California

[email protected]

Dimensions of Multiagent Learning
• Ignore others’ learning vs Model others’ learning
• Cooperative vs Competitive
• Cooperative
• Learn to coordinate with others
• Learning organizational roles
• Competitive (conflicting learning goals)
• Learning to play better against adversary
• Opponent modeling
• We will focus on reinforcement learning: Q-learning methods
Some Terminology
• Q-learning
• Model-free vs Model-based
Q-learning
• Q-values: Q(s,a)
• Related to utility values:
• U(s) = max Q(s,a)
• Following equation must hold at equilibrium:

Q(i,a) = R(i) + SP(j|i,a)* max Q(j,a’)

• Requires learning a model!

a

J

a’

TD Q-learning
• Update equation for TD Q-learning is:

Q(i,a)  Q(i,a) +  (R(i) + max Q( j, a’) – Q(i,a))

• What if  = 0?
• What if  = 1?

a’

Q Learning Agent
• Q-learning-agent (e) returns an action
• e is the percept
• Q: table of action values
• N: table of state-action frequencies
• a, the last action
• I, previous state
• J  state[e]
• N[I,a]  N[I,a] +1
• Q[I,a]  Q[I,a] +  (R(i) + max Q(j,a’) – Q( I, a))
• I  J
• Return (action a’ that maximizes f(Q( j, a’), N[j,a’]))

a’

Choosing an Action….
• Step 5: choosing the best action to take in state J

(a’ is the action chosen using f(Q(a’, j), N[a’, j]))

• Suppose all Q values initially zero, and f(Q(a’, j), N[a’, j]) chooses max Q(a’, j)
• Suppose after first exploration:
• Q[J,A1] = 10, Q[J,A2] = 0, Q[J,A3] = 0, Q[J,A4] = 0

What will happen? Is this a problem?

Exploration vs Exploitation
• Tradeoff: immediate good (exploit) vs long-term (explore)
• Continuous exploration vs stuck to well-known path
• Key question: How to balance the two?
• One approach:
• Give some weight to actions not tried often
• Avoid actions that are of low utility
Exploration
• Giving “weight” action not tried very often

f(Q(a’, j), N[a’, j])) = argmax G(Q(a’, j), N[a’, j]))

• G returns:
• very high “R” if N(a’, j) < N-VISITS
• otherwise Q (a’, j)
• What will be the result of such a function G?

a’

### Two Frameworks for Multiagent “Learning”

DCOP: Exploration + Exploitation (paper to be posted on the web site) [Jain et al IJCAI’09]

Stochastic games: Multiagent learning to reach N.E. (in our readings)

DCOP Framework

a1

a2

a3

• Assign values to distributed variables
• Optimize total reward
• No central control
New Challenges
• Reward matrices unknown
• Algorithms explore environment
• Maximize total cumulative signal strength
• Changes measuring of DCOP algorithms
• Limited time horizon
• Not explore everything
• Horizon-aware DCOPs
DCOP Framework: Reward Matrix Unknown

a1

a2

a3

Assigning values to variables = Exploration

Exploration takes time (physical movement)

Limited time; full exploration impossible

Three New Algorithms
• Based on MGM (maximum gain message)
• Hill climbing
• Communicate possible gain to neighbors
• Agent with max gain “moves”
• Proposed new algorithms:
• SE-optimistic: Unexplored domain values yield ‘maximum’
• Optimistic: Maximal Potential Gain Messaging
• Exploration maximized: look for max value
• SE-mean: Unexplored domain values yield ‘mean’ reward
• “Realistic”: Limit exploration, satisfied by mean
• BE-backtrack : Lookahead given reward function distribution
• Intelligent: Decision theoretic limit on exploration

Gain =15

Gain =20

a1

a2

a3

DCOP Framework: Reward Matrix Unknown

a1

a2

a3

What if 20 is max reward?

SE-optimistic: how will it work?

• Agent decides: ‘explore’ or ‘backtrack’ to explored state
• Let Rb be the best reward among explored states
• The agent will explore for T units only if
• EU(Explore) > EU(backtrack)
• Expected Utility of Backtrack:
• EU(backtrack) = Rb*T
• Expected utility of explore is calculated as:
• P(x,n,te) is the first order statistic of choosing the maximum as ‘x’ in te trials
• E.U.(explore) is sum of three terms:
• utility of exploring
• utility of finding a better reward than current Rb
• utility of failing to find a better reward than current Rb
Sample Results(Jain et al, IJCAI’09)
• Decision theoretic approach to exploration
• Interleave with DCOPs

### Towards Multiagent Learning:Stochastic Game

Generalize distributed POMDPs

Different payoffs for each player, not a common payoff

Focus on two person stochastic game

Learning algorithms for stochastic games

Stochastic 2-player Game
• States: S
• Action sets for each player: A1, A2
• P transition probabilities: P(s’| s, a1, a2)
• R or Reward: two separate rewards:
• R1(s, a1,a2), depends on actions of all agents
• R2(s, a1, a2), depends on actions of all agents
• If R1(s,a1,a2) + R2(s,a1,a2) = 0, then zero sum game
• State observable (MDP like)
• Each player: maximize its own (discounted) sum of rewards
Stochastic Game

P(s1|s0,a1,a2)

R1(s0)

R2(s0)

R1(s1)

R2(s1)

R1(s2)

R2(s2)

P(s2|s0,a1,a2)

Reward function depends on the state!

Stochastic Game
• How are repeated games related to stochastic games?
Stochastic Game
• Strategies = Policies
• Since rewards differ for each agent
• hence expected values differ as well
• v1(s, π1, π2) gives us the expected value for agent1 in state s, given that agents pursued policies π1, π2
• Nash equilibrium in stochastic game:
• pair of strategies (π1*, π2*) such that for all states s

v1(s, π1*, π2*) >= v1(s, π1, π2*)

And

v2(s, π1*, π2*) >= v2(s, π1*, π2)

Nash Equlibrium Policies
• In Stochastic games, we focus on policies that attain Nash equilibrium
• If we don’t find Nash equilibrium, then players may have an incentive to deviate
• Search for stability is critical
• Policies may be randomized; may not be deterministic
Example Stochastic game
• Goalee can move or stay
• Shooter can move or shoot
• Zero sum game, goal worth 10 points to shooter
• Blocking worth 5 points to goalee

### Work out example

Q-learning in Stochastic Games

Nash-Q algorithm:

• Q1(s, a1,a2) - Q value of agent1 for state S
• Q2(s, a1, a2)  Q value of agent2 for state S
• Optimal Q values:
• Q1*(s, a1, a2) = R1(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V1(s’, π1*, π2*)
• Q2*(s, a1, a2) = R2(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V2(s’, π1*, π2*)

### Example

Algorithm

Consider two agents:

• Each agent maintains m Q-tables, m = number of states
• For each state, maintain |A1|*|A2| number of entries in the Q-table
• |A1| for my actions
• |A2| for other agents’ actions
• Q-tables for me and for the other agent
Key Observation
• State s’
• Bimatrix representation: Q1[s’], Q2[s’]
• Defines a game
• Can find mixed strategy nash equilibrium for this game
• Mixed strategy Nash equilibrium:
• Provides probability distribution for what action to execute
Multiagent Q-Learning
• Initialize Q tables
• Loop:
• Choose action a1 based on π1(s), which is a mixed strategy Nash equilibrium of the game defined by (Q1(s), Q2(s))
• Observe r1, r2, a2, s’
• Update Q1(s) and Q2(s) using the equations defined below
• Q1(s,a1,a2)  Q1(s, a1,a2)+  (R(s)+ λ[ Z1 ]– Q1(s,a1,a2))
• Z1 = expected reward given N.E. in state s’
• due to game Q1(s’),Q2(s’)

### What do we end up with:

Agents converging into the Nash equilibrium

Towards Multiagent Learning
• Learning “single agent” in a multiagent setting
• Ignore other agents except for some property like location
• Ignore that other agents act intentionally, adapt
• Simpler
• Easily converges
Single Agent in Multiagent Setting
• RoboCup Soccer Simulation League
• Players use model-free reinforcement learning to intercept the ball
• Learn “on line” during the game
Finding #1: Online Learning Specialized by Opponent
• Same player position against two different RoboCup teams:
• Player 1 (forward) against CMUnited and Andhill
• Against CMUnited, player turns more aggressively
Finding #2: Online Learning Specialized by Role

Same team against different players

Player 1 (forward) and Player 10 (fullback) against CMUnited

Lessons Learned
• Surprise in tests against opponent teams:
• Significant specialization of intercept with both role & opponent
• Lesson: Transfer of experience or cross-training may be detrimental