Multiagent learning using a variable learning rate
Download
1 / 42

Multiagent learning using a variable learning rate - PowerPoint PPT Presentation


  • 64 Views
  • Uploaded on

Multiagent learning using a variable learning rate. Igor Kiselev, University of Waterloo. M. Bowling and M. Veloso. Artificial Intelligence, Vol. 136, 2002, pp. 215 - 250. Agenda. Introduction Motivation to multi-agent learning MDP framework Stochastic game framework

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Multiagent learning using a variable learning rate' - laken


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Multiagent learning using a variable learning rate

Multiagent learning using a variable learning rate

Igor Kiselev, University of Waterloo

M. Bowling and M. Veloso.Artificial Intelligence, Vol. 136, 2002, pp. 215 - 250.


Agenda
Agenda

Introduction

Motivation to multi-agent learning

MDP framework

Stochastic game framework

Reinforcement Learning: single-agent, multi-agent

Related work:

Multiagent learning with a variable learning rate

Theoretical analysis of the replicator dynamics

WoLF Incremental Gradient Ascent algorithm

WoLF Policy Hill Climbing algorithm

Results

Concluding remarks


Introduction

Introduction

Motivation to multi-agent learning


Mal is a challenging and interesting task
MAL is a Challenging and Interesting Task

  • Research goal is to enable an agent effectively learn to act (cooperate, compete) in the presence of other learning agents in complex domains.

  • Equipping MAS with learning capabilities permits the agent to deal with large, open, dynamic, and unpredictable environments

  • Multi-agent learning (MAL) is a challenging problem for developing intelligent systems.

  • Multiagent environments are non-stationary, violating the traditional assumption underlying single-agent learning




Preliminaries

Preliminaries

MDP and Stochastic Game Frameworks


Single agent reinforcement learning

Rewards

Observations,

Sensations

Learning

Algorithm

World,

State

Actions

Policy

Single-agent Reinforcement Learning

  • Independent learners act ignoring the existence of others

  • Stationary environment

  • Learn policy that maximizes individual utility (“trial-error”)

  • Perform their actions, obtain a reward and update their Q-values without regard to the actions performed by others

R. S. Sutton, 1997


Markov decision processes mdp framework

r

r

r

r

. . .

. . .

t +1

t +2

t +f = 0

s

s

t +3

s

s

t +1

t +2

t +3

a

a

a

a

a

t

s

t +1

t +2

t

t +3

t+f-1

t +f

Markov Decision Processes / MDP Framework

Environment is a modeled as an MDP, defined by (S, A, R, T)

S – finite set of states of the environment

A(s) – set of actions possible in state sS

T: S×A → P – set transition function from state-action pairs to states

R(s,s',a)– expected reward on transition (s to s‘)

P(s,s',a) – probability of transition from s to s'

 – discount rate for delayed reward

Each discrete time t = 0, 1, 2, . . . agent:

  • observes state StS

  • chooses action atA

  • receives immediate reward rt,

  • state changes to St+1

T. M. Mitchell, 1997


Agent s learning task find optimal action selection policy

Find a policy sS  aA(s)

that maximizes thevalue (expectedfuture reward) of each s :

and each s,apair:

V (s) = E{r+ r+r+ s =s, }

2

. . .

t +1t +2t +3 t

rewards

Q (s,a) = E{r+ r+r+ s =s, a =a, }

2

. . .

t +1t +2t +3 t t

Agent’s learning task – find optimal action selection policy

Execute actions in environment, observe results, and learn to construct an optimal action selection policy that maximizes the agent's performance - the long-term Total Discounted Reward

T. M. Mitchell, 1997


Agent s learning strategy q learning method
Agent’s Learning Strategy – Q-Learning method

  • Q-function - iterative approximation of Q values with learning rateβ: 0≤ β<1

  • Q-Learning incremental process

    • Observe the current state s

    • Select an action with probability based on the employed selection policy

    • Observe the new state s′

    • Receive a reward r from the environment

    • Update the corresponding Q-value for action a and state s

    • Terminate the current trial if the new state s′ satisfies a terminal condition; otherwise let s′→ sand go back to step 1


Multi agent framework
Multi-agent Framework

  • Learning in multi-agent setting

    • all agents simultaneously learning

    • environment not stationary (other agents are evolving)

    • problem of a “moving target”


Stochastic game framework for addressing mal
Stochastic Game Framework for addressing MAL

From the perspective of sequential decision making:

  • Markov decision processes

    • one decision maker

    • multiple states

  • Repeated games

    • multiple decision makers

    • one state

  • Stochastic games (Markov games)

    • extension of MDPs to multiple decision makers

    • multiple states


Stochastic game notation

[ ]

T(s,a,s)

a2

s

[ ]

[

]

s

a1

R1(s,a), R2(s,a), …

[ ]

Stochastic Game / Notation

S: Set of states (n-agent stage games)

Ri(s,a): Reward to player i in state s under joint action a

T(s,a,s): Probability of transition from s to state s on a

From dynamic programming approach:

Qi(s,a): Long-run payoff to i from s on athen equilibrium


Approach

Approach

Multiagent learning using a variable learning rate


Evaluation criteria for multi agent learning
Evaluation criteria for multi-agent learning

  • Use of convergence to NE is problematic:

    • Terminating criterion: Equilibrium identifies conditions under which learning can or should stop

    • Easier to play in equilibrium as opposed to continued computation

    • Nash equilibrium strategy has no “prescriptive force”: say anything prior to termination

    • Multiple potential equilibria

    • Opponent may not wish to play an equilibria

    • Calculating a Nash Equilibrium can be intractable for large games

  • New criteria: rationality and convergence in self-play

    • Converge to stationary policy: not necessarily Nash

    • Only terminates once best response to play of other agents is found

    • During self play, learning is only terminated in a stationary NE


Contributions and assumptions
Contributions and Assumptions

  • Contributions:

    • Criterion for multi-agent learning algorithms

    • A simple Q-learning algorithm that can play mixed strategies

    • The WoLF PHC (Win or Lose Fast Policy Hill Climber)

  • Assumptions - gets both properties given that:

    • The game is two-player, two-action

    • Players can observe each other’s mixed strategies (not just the played action)

    • Can use infinitesimally small step sizes


Opponent modeling or joint action learners
Opponent Modeling or Joint-Action Learners

C. Claus, C. Boutilier, 1998


Joint action learners method
Joint-Action Learners Method

  • Maintains an explicit model of the opponents for each state.

  • Q-values are maintained for all possible joint actions at a given state

  • The key assumption is that the opponent is stationary

  • Thus, the model of the opponent is simply frequencies of actions played in the past

  • Probability of playing action a-i:

  • where C(a−i) is the number of times the opponent has played action a−i.

  • n(s) is the number of times state s has been visited.



Wolf principles
WoLF Principles

  • The idea is to use two different strategy update steps, one for winning and another one for loosing situations

  • “Win or Learn Fast”: agent reduces its learning rate when performing well, and increases when doing badly. Improves convergence of IGA and policy hill-climbing

  • To distinguish between those situations, the player keeps track of two policies.

  • Winning is considered if the expected utility of the actual policy is greater than the expected utility of the equilibrium (or average) policy.

  • If winning, the smaller of two strategy update steps is chosen by the winning agent.


Incremental gradient ascent learners iga
Incremental Gradient Ascent Learners (IGA)

  • IGA:

    • incrementally climbs on the mixed strategy space

    • for 2-player 2-action general sum games

    • guarantees convergence to a Nash equilibrium or guarantees convergence to an average payoff that is sustained by some Nash equilibrium

  • WoLF IGA:

    • based on WoLF principle

    • converges guarantee to a Nash equilibrium for all 2 player 2 action general sum games



Simple q learner that plays mixed strategies
Simple Q-Learner that plays mixed strategies

Problems:

  • guarantees rationality against stationary opponents

  • does not converge in self-play

Updating a mixed strategy by giving more weight to the action that Q-learning believes is the best


Wolf policy hill climbing algorithm
WoLF Policy Hill Climbing algorithm

  • agent only need to see its own payoff

  • converges for two player two action SG’s in self-play

Maintaining average policy

Probability of playing action

Determination of “W” and “L”: by comparing the expected value of the current policy to that of the average policy


Theoretical analysis

Theoretical analysis

Analysis of the replicator dynamics


Replicator dynamics simplification case
Replicator Dynamics – Simplification Case

Best response dynamics for Paper-Rock-Scissors

Circular shift from one agent’s policy to the other’s average reward


A winning strategy against phc
A winning strategy against PHC

1

If winning

play probability 1 for

current preferred action

in order to maximize

rewards while winning

If losing

play a deceiving policy

until we are ready to take

advantage of them again

0.5

Probability opponent plays heads

0

1

0.5

Probability we play heads


Ideally we d like to see this

winning

losing

Ideally we’d like to see this:



Convergence dynamics of strategies
Convergence dynamics of strategies

  • Iterated Gradient Ascent:

  • Again does a myopic adaptation to other players’ current strategy.

    • Either converges to a Nash fixed point on the boundary (at least one pure strategy), or get limit cycles

    • Vary learning rates to be optimal while satisfying both properties



Experimental testbeds
Experimental testbeds

  • Matrix Games

    • Matching pennies

    • Three-player matching pennies

    • Rock-paper-scissors

  • Gridworld

  • Soccer





Summary and conclusion
Summary and Conclusion

  • Criterion for multi-agent learning algorithms: rationality and convergence

  • A simple Q-learning algorithm that can play mixed strategies

  • The WoLF PHC (Win or Lose Fast Policy Hill Climber) to satisfy rationality and convergence


Disadvantages
Disadvantages

  • Analysis for two-player, two-action games: pseudoconvergence

  • Avoidance of exploitation

    • guaranteeing that the learner cannot be deceptively exploited by another agent

    • Chang and Kaelbling (2001) demonstrated that the best-response learner

  • PHC (Bowling & Veloso, 2002) could be exploited by a particular dynamic strategy.



Future work by authors
Future Work by Authors

  • Exploring learning outside of self-play:whether WoLF techniques can be exploited by a malicious (not rational) “learner”.

  • Scaling to large problems:combining single-agent scaling solutions (function approximators and parameterized policies) with the concepts of a variable learning rate and WoLF.

  • Online learning

  • List other algorithms of authors:

    • GIGA-WoLF, normal form games


Discussion open questions
Discussion / Open Questions

  • Investigation other evaluation criteria:

    • No-regret criteria

    • Negative non-convergence regret (NNR)

    • Fast reaction (tracking) [Jensen]

    • Performance: maximum time for reaching a desired performance level

  • Incorporating more algorithms into testing: deeper comparison with more simple and more complex algorithms (e.g. AWESOME [Conitzer and Sandholm 2003])

  • Classification of situations (games) with various values of the delta and alpha variables: what values are good in what situations.

  • Extending work to have more players.

  • Online learning and exploration policy in stochastic games (trade-off)

  • Currently the formalism is presented in two dimensional state-space: possibility for extension of the formal model (geometrical ?)?

  • What does make Minimax-Q irrational?

  • Application of WoLF to multi-agent evolutionary algorithms (e.g. to control the mutation rate) or to learning of neural networks (e.g. to determine a winner neuron)?

  • Connection with control theory and learning of Complex Adaptive Systems: manifold-adaptive learning?


Questions

Questions

Thank you


ad