multiagent learning using a variable learning rate n.
Skip this Video
Download Presentation
Multiagent learning using a variable learning rate

Loading in 2 Seconds...

play fullscreen
1 / 42

Multiagent learning using a variable learning rate - PowerPoint PPT Presentation

  • Uploaded on

Multiagent learning using a variable learning rate. Igor Kiselev, University of Waterloo. M. Bowling and M. Veloso. Artificial Intelligence, Vol. 136, 2002, pp. 215 - 250. Agenda. Introduction Motivation to multi-agent learning MDP framework Stochastic game framework

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Multiagent learning using a variable learning rate' - laken

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
multiagent learning using a variable learning rate

Multiagent learning using a variable learning rate

Igor Kiselev, University of Waterloo

M. Bowling and M. Veloso.Artificial Intelligence, Vol. 136, 2002, pp. 215 - 250.



Motivation to multi-agent learning

MDP framework

Stochastic game framework

Reinforcement Learning: single-agent, multi-agent

Related work:

Multiagent learning with a variable learning rate

Theoretical analysis of the replicator dynamics

WoLF Incremental Gradient Ascent algorithm

WoLF Policy Hill Climbing algorithm


Concluding remarks



Motivation to multi-agent learning

mal is a challenging and interesting task
MAL is a Challenging and Interesting Task
  • Research goal is to enable an agent effectively learn to act (cooperate, compete) in the presence of other learning agents in complex domains.
  • Equipping MAS with learning capabilities permits the agent to deal with large, open, dynamic, and unpredictable environments
  • Multi-agent learning (MAL) is a challenging problem for developing intelligent systems.
  • Multiagent environments are non-stationary, violating the traditional assumption underlying single-agent learning


MDP and Stochastic Game Frameworks

single agent reinforcement learning










Single-agent Reinforcement Learning
  • Independent learners act ignoring the existence of others
  • Stationary environment
  • Learn policy that maximizes individual utility (“trial-error”)
  • Perform their actions, obtain a reward and update their Q-values without regard to the actions performed by others

R. S. Sutton, 1997

markov decision processes mdp framework





. . .

. . .

t +1

t +2

t +f = 0



t +3



t +1

t +2

t +3








t +1

t +2


t +3


t +f

Markov Decision Processes / MDP Framework

Environment is a modeled as an MDP, defined by (S, A, R, T)

S – finite set of states of the environment

A(s) – set of actions possible in state sS

T: S×A → P – set transition function from state-action pairs to states

R(s,s',a)– expected reward on transition (s to s‘)

P(s,s',a) – probability of transition from s to s'

 – discount rate for delayed reward

Each discrete time t = 0, 1, 2, . . . agent:

  • observes state StS
  • chooses action atA
  • receives immediate reward rt,
  • state changes to St+1

T. M. Mitchell, 1997

agent s learning task find optimal action selection policy

Find a policy sS  aA(s)

that maximizes thevalue (expectedfuture reward) of each s :

and each s,apair:

V (s) = E{r+ r+r+ s =s, }


. . .

t +1t +2t +3 t


Q (s,a) = E{r+ r+r+ s =s, a =a, }


. . .

t +1t +2t +3 t t

Agent’s learning task – find optimal action selection policy

Execute actions in environment, observe results, and learn to construct an optimal action selection policy that maximizes the agent's performance - the long-term Total Discounted Reward

T. M. Mitchell, 1997

agent s learning strategy q learning method
Agent’s Learning Strategy – Q-Learning method
  • Q-function - iterative approximation of Q values with learning rateβ: 0≤ β<1
  • Q-Learning incremental process
    • Observe the current state s
    • Select an action with probability based on the employed selection policy
    • Observe the new state s′
    • Receive a reward r from the environment
    • Update the corresponding Q-value for action a and state s
    • Terminate the current trial if the new state s′ satisfies a terminal condition; otherwise let s′→ sand go back to step 1
multi agent framework
Multi-agent Framework
  • Learning in multi-agent setting
    • all agents simultaneously learning
    • environment not stationary (other agents are evolving)
    • problem of a “moving target”
stochastic game framework for addressing mal
Stochastic Game Framework for addressing MAL

From the perspective of sequential decision making:

  • Markov decision processes
    • one decision maker
    • multiple states
  • Repeated games
    • multiple decision makers
    • one state
  • Stochastic games (Markov games)
    • extension of MDPs to multiple decision makers
    • multiple states
stochastic game notation

[ ]




[ ]





R1(s,a), R2(s,a), …

[ ]

Stochastic Game / Notation

S: Set of states (n-agent stage games)

Ri(s,a): Reward to player i in state s under joint action a

T(s,a,s): Probability of transition from s to state s on a

From dynamic programming approach:

Qi(s,a): Long-run payoff to i from s on athen equilibrium



Multiagent learning using a variable learning rate

evaluation criteria for multi agent learning
Evaluation criteria for multi-agent learning
  • Use of convergence to NE is problematic:
    • Terminating criterion: Equilibrium identifies conditions under which learning can or should stop
    • Easier to play in equilibrium as opposed to continued computation
    • Nash equilibrium strategy has no “prescriptive force”: say anything prior to termination
    • Multiple potential equilibria
    • Opponent may not wish to play an equilibria
    • Calculating a Nash Equilibrium can be intractable for large games
  • New criteria: rationality and convergence in self-play
    • Converge to stationary policy: not necessarily Nash
    • Only terminates once best response to play of other agents is found
    • During self play, learning is only terminated in a stationary NE
contributions and assumptions
Contributions and Assumptions
  • Contributions:
    • Criterion for multi-agent learning algorithms
    • A simple Q-learning algorithm that can play mixed strategies
    • The WoLF PHC (Win or Lose Fast Policy Hill Climber)
  • Assumptions - gets both properties given that:
    • The game is two-player, two-action
    • Players can observe each other’s mixed strategies (not just the played action)
    • Can use infinitesimally small step sizes
opponent modeling or joint action learners
Opponent Modeling or Joint-Action Learners

C. Claus, C. Boutilier, 1998

joint action learners method
Joint-Action Learners Method
  • Maintains an explicit model of the opponents for each state.
  • Q-values are maintained for all possible joint actions at a given state
  • The key assumption is that the opponent is stationary
  • Thus, the model of the opponent is simply frequencies of actions played in the past
  • Probability of playing action a-i:
  • where C(a−i) is the number of times the opponent has played action a−i.
  • n(s) is the number of times state s has been visited.
wolf principles
WoLF Principles
  • The idea is to use two different strategy update steps, one for winning and another one for loosing situations
  • “Win or Learn Fast”: agent reduces its learning rate when performing well, and increases when doing badly. Improves convergence of IGA and policy hill-climbing
  • To distinguish between those situations, the player keeps track of two policies.
  • Winning is considered if the expected utility of the actual policy is greater than the expected utility of the equilibrium (or average) policy.
  • If winning, the smaller of two strategy update steps is chosen by the winning agent.
incremental gradient ascent learners iga
Incremental Gradient Ascent Learners (IGA)
  • IGA:
    • incrementally climbs on the mixed strategy space
    • for 2-player 2-action general sum games
    • guarantees convergence to a Nash equilibrium or guarantees convergence to an average payoff that is sustained by some Nash equilibrium
  • WoLF IGA:
    • based on WoLF principle
    • converges guarantee to a Nash equilibrium for all 2 player 2 action general sum games
simple q learner that plays mixed strategies
Simple Q-Learner that plays mixed strategies


  • guarantees rationality against stationary opponents
  • does not converge in self-play

Updating a mixed strategy by giving more weight to the action that Q-learning believes is the best

wolf policy hill climbing algorithm
WoLF Policy Hill Climbing algorithm
  • agent only need to see its own payoff
  • converges for two player two action SG’s in self-play

Maintaining average policy

Probability of playing action

Determination of “W” and “L”: by comparing the expected value of the current policy to that of the average policy

theoretical analysis

Theoretical analysis

Analysis of the replicator dynamics

replicator dynamics simplification case
Replicator Dynamics – Simplification Case

Best response dynamics for Paper-Rock-Scissors

Circular shift from one agent’s policy to the other’s average reward

a winning strategy against phc
A winning strategy against PHC


If winning

play probability 1 for

current preferred action

in order to maximize

rewards while winning

If losing

play a deceiving policy

until we are ready to take

advantage of them again


Probability opponent plays heads




Probability we play heads

convergence dynamics of strategies
Convergence dynamics of strategies
  • Iterated Gradient Ascent:
  • Again does a myopic adaptation to other players’ current strategy.
    • Either converges to a Nash fixed point on the boundary (at least one pure strategy), or get limit cycles
    • Vary learning rates to be optimal while satisfying both properties
experimental testbeds
Experimental testbeds
  • Matrix Games
    • Matching pennies
    • Three-player matching pennies
    • Rock-paper-scissors
  • Gridworld
  • Soccer
summary and conclusion
Summary and Conclusion
  • Criterion for multi-agent learning algorithms: rationality and convergence
  • A simple Q-learning algorithm that can play mixed strategies
  • The WoLF PHC (Win or Lose Fast Policy Hill Climber) to satisfy rationality and convergence
  • Analysis for two-player, two-action games: pseudoconvergence
  • Avoidance of exploitation
    • guaranteeing that the learner cannot be deceptively exploited by another agent
    • Chang and Kaelbling (2001) demonstrated that the best-response learner
  • PHC (Bowling & Veloso, 2002) could be exploited by a particular dynamic strategy.
future work by authors
Future Work by Authors
  • Exploring learning outside of self-play:whether WoLF techniques can be exploited by a malicious (not rational) “learner”.
  • Scaling to large problems:combining single-agent scaling solutions (function approximators and parameterized policies) with the concepts of a variable learning rate and WoLF.
  • Online learning
  • List other algorithms of authors:
    • GIGA-WoLF, normal form games
discussion open questions
Discussion / Open Questions
  • Investigation other evaluation criteria:
    • No-regret criteria
    • Negative non-convergence regret (NNR)
    • Fast reaction (tracking) [Jensen]
    • Performance: maximum time for reaching a desired performance level
  • Incorporating more algorithms into testing: deeper comparison with more simple and more complex algorithms (e.g. AWESOME [Conitzer and Sandholm 2003])
  • Classification of situations (games) with various values of the delta and alpha variables: what values are good in what situations.
  • Extending work to have more players.
  • Online learning and exploration policy in stochastic games (trade-off)
  • Currently the formalism is presented in two dimensional state-space: possibility for extension of the formal model (geometrical ?)?
  • What does make Minimax-Q irrational?
  • Application of WoLF to multi-agent evolutionary algorithms (e.g. to control the mutation rate) or to learning of neural networks (e.g. to determine a winner neuron)?
  • Connection with control theory and learning of Complex Adaptive Systems: manifold-adaptive learning?


Thank you