- By
**admon** - Follow User

- 136 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Software Multiagent Systems: CS543' - admon

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Two Frameworks for Multiagent “Learning”

### Towards Multiagent Learning:Stochastic Game

### What do we end up with:

Dimensions of Multiagent Learning

- Ignore others’ learning vs Model others’ learning
- Cooperative vs Competitive
- Cooperative
- Learn to coordinate with others
- Learning organizational roles
- Competitive (conflicting learning goals)
- Learning to play better against adversary
- Opponent modeling
- We will focus on reinforcement learning: Q-learning methods

Some Terminology

- Q-learning
- Model-free vs Model-based

Q-learning

- Q-values: Q(s,a)
- Related to utility values:
- U(s) = max Q(s,a)
- Following equation must hold at equilibrium:

Q(i,a) = R(i) + SP(j|i,a)* max Q(j,a’)

- Requires learning a model!

a

J

a’

TD Q-learning

- Update equation for TD Q-learning is:

Q(i,a) Q(i,a) + (R(i) + max Q( j, a’) – Q(i,a))

- What if = 0?
- What if = 1?

a’

Q Learning Agent

- Q-learning-agent (e) returns an action
- e is the percept
- Q: table of action values
- N: table of state-action frequencies
- a, the last action
- I, previous state
- J state[e]
- N[I,a] N[I,a] +1
- Q[I,a] Q[I,a] + (R(i) + max Q(j,a’) – Q( I, a))
- I J
- Return (action a’ that maximizes f(Q( j, a’), N[j,a’]))

a’

Choosing an Action….

- Step 5: choosing the best action to take in state J

(a’ is the action chosen using f(Q(a’, j), N[a’, j]))

- Suppose all Q values initially zero, and f(Q(a’, j), N[a’, j]) chooses max Q(a’, j)
- Suppose after first exploration:
- Q[J,A1] = 10, Q[J,A2] = 0, Q[J,A3] = 0, Q[J,A4] = 0

What will happen? Is this a problem?

Exploration vs Exploitation

- Tradeoff: immediate good (exploit) vs long-term (explore)
- Continuous exploration vs stuck to well-known path
- Key question: How to balance the two?

- One approach:
- Give some weight to actions not tried often
- Avoid actions that are of low utility

Exploration

- Giving “weight” action not tried very often

f(Q(a’, j), N[a’, j])) = argmax G(Q(a’, j), N[a’, j]))

- G returns:
- very high “R” if N(a’, j) < N-VISITS
- otherwise Q (a’, j)
- What will be the result of such a function G?

a’

DCOP: Exploration + Exploitation (paper to be posted on the web site) [Jain et al IJCAI’09]

Stochastic games: Multiagent learning to reach N.E. (in our readings)

DCOPS for Mobile Sensor Setworks

(with Lockheed ATL)

New Challenges

- Reward matrices unknown
- Algorithms explore environment
- Maximize total cumulative signal strength
- Changes measuring of DCOP algorithms
- Limited time horizon
- Not explore everything
- Horizon-aware DCOPs

DCOP Framework: Reward Matrix Unknown

a1

a2

a3

Assigning values to variables = Exploration

Exploration takes time (physical movement)

Limited time; full exploration impossible

Three New Algorithms

- Based on MGM (maximum gain message)
- Hill climbing
- Communicate possible gain to neighbors
- Agent with max gain “moves”

- Proposed new algorithms:
- SE-optimistic: Unexplored domain values yield ‘maximum’
- Optimistic: Maximal Potential Gain Messaging
- Exploration maximized: look for max value
- SE-mean: Unexplored domain values yield ‘mean’ reward
- “Realistic”: Limit exploration, satisfied by mean
- BE-backtrack : Lookahead given reward function distribution
- Intelligent: Decision theoretic limit on exploration

Gain =15

Gain =20

a1

a2

a3

DCOP Framework: Reward Matrix Unknown

a1

a2

a3

What if 20 is max reward?

SE-optimistic: how will it work?

Lookahead

- Agent decides: ‘explore’ or ‘backtrack’ to explored state
- Let Rb be the best reward among explored states
- The agent will explore for T units only if
- EU(Explore) > EU(backtrack)
- Expected Utility of Backtrack:
- EU(backtrack) = Rb*T

Lookahead

- Expected utility of explore is calculated as:
- P(x,n,te) is the first order statistic of choosing the maximum as ‘x’ in te trials
- E.U.(explore) is sum of three terms:
- utility of exploring
- utility of finding a better reward than current Rb
- utility of failing to find a better reward than current Rb

Sample Results(Jain et al, IJCAI’09)

- Decision theoretic approach to exploration
- Interleave with DCOPs

Generalize distributed POMDPs

Different payoffs for each player, not a common payoff

Focus on two person stochastic game

Learning algorithms for stochastic games

Stochastic 2-player Game

- States: S
- Action sets for each player: A1, A2
- P transition probabilities: P(s’| s, a1, a2)
- R or Reward: two separate rewards:
- R1(s, a1,a2), depends on actions of all agents
- R2(s, a1, a2), depends on actions of all agents
- If R1(s,a1,a2) + R2(s,a1,a2) = 0, then zero sum game
- State observable (MDP like)
- Each player: maximize its own (discounted) sum of rewards

Stochastic Game

P(s1|s0,a1,a2)

R1(s0)

R2(s0)

R1(s1)

R2(s1)

R1(s2)

R2(s2)

P(s2|s0,a1,a2)

Reward function depends on the state!

Stochastic Game

- How are repeated games related to stochastic games?

Stochastic Game

- Strategies = Policies
- Since rewards differ for each agent
- hence expected values differ as well
- v1(s, π1, π2) gives us the expected value for agent1 in state s, given that agents pursued policies π1, π2
- Nash equilibrium in stochastic game:
- pair of strategies (π1*, π2*) such that for all states s

v1(s, π1*, π2*) >= v1(s, π1, π2*)

And

v2(s, π1*, π2*) >= v2(s, π1*, π2)

Nash Equlibrium Policies

- In Stochastic games, we focus on policies that attain Nash equilibrium
- If we don’t find Nash equilibrium, then players may have an incentive to deviate
- Search for stability is critical
- Policies may be randomized; may not be deterministic

Example Stochastic game

- Goalee can move or stay
- Shooter can move or shoot
- Zero sum game, goal worth 10 points to shooter
- Blocking worth 5 points to goalee

Q-learning in Stochastic Games

Nash-Q algorithm:

- Q1(s, a1,a2) - Q value of agent1 for state S
- Q2(s, a1, a2) Q value of agent2 for state S
- Optimal Q values:
- Q1*(s, a1, a2) = R1(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V1(s’, π1*, π2*)
- Q2*(s, a1, a2) = R2(s, a1, a2) + λΣs’ P(s’|s,a1,a2)* V2(s’, π1*, π2*)

Algorithm

Consider two agents:

- Each agent maintains m Q-tables, m = number of states
- For each state, maintain |A1|*|A2| number of entries in the Q-table
- |A1| for my actions
- |A2| for other agents’ actions
- Q-tables for me and for the other agent

Key Observation

- State s’
- Bimatrix representation: Q1[s’], Q2[s’]
- Defines a game
- Can find mixed strategy nash equilibrium for this game
- Mixed strategy Nash equilibrium:
- Provides probability distribution for what action to execute

Multiagent Q-Learning

- Initialize Q tables
- Loop:
- Choose action a1 based on π1(s), which is a mixed strategy Nash equilibrium of the game defined by (Q1(s), Q2(s))
- Observe r1, r2, a2, s’
- Update Q1(s) and Q2(s) using the equations defined below
- Q1(s,a1,a2) Q1(s, a1,a2)+ (R(s)+ λ[ Z1 ]– Q1(s,a1,a2))
- Z1 = expected reward given N.E. in state s’
- due to game Q1(s’),Q2(s’)

Agents converging into the Nash equilibrium

Towards Multiagent Learning

- Learning “single agent” in a multiagent setting
- Ignore other agents except for some property like location
- Ignore that other agents act intentionally, adapt
- Advantages:
- Simpler
- Easily converges

Single Agent in Multiagent Setting

- RoboCup Soccer Simulation League
- Players use model-free reinforcement learning to intercept the ball
- Learn “on line” during the game

Finding #1: Online Learning Specialized by Opponent

- Same player position against two different RoboCup teams:
- Player 1 (forward) against CMUnited and Andhill
- Against CMUnited, player turns more aggressively

Finding #2: Online Learning Specialized by Role

Same team against different players

Player 1 (forward) and Player 10 (fullback) against CMUnited

Lessons Learned

- Surprise in tests against opponent teams:
- Significant specialization of intercept with both role & opponent
- Lesson: Transfer of experience or cross-training may be detrimental

Download Presentation

Connecting to Server..