Loading in 5 sec....

Multiagent learning using a variable learning ratePowerPoint Presentation

Multiagent learning using a variable learning rate

- By
**laken** - Follow User

- 64 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Multiagent learning using a variable learning rate' - laken

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Multiagent learning using a variable learning rate

### Introduction

### Preliminaries

### Approach

### Theoretical analysis

### Questions

Igor Kiselev, University of Waterloo

M. Bowling and M. Veloso.Artificial Intelligence, Vol. 136, 2002, pp. 215 - 250.

Agenda

Introduction

Motivation to multi-agent learning

MDP framework

Stochastic game framework

Reinforcement Learning: single-agent, multi-agent

Related work:

Multiagent learning with a variable learning rate

Theoretical analysis of the replicator dynamics

WoLF Incremental Gradient Ascent algorithm

WoLF Policy Hill Climbing algorithm

Results

Concluding remarks

Motivation to multi-agent learning

MAL is a Challenging and Interesting Task

- Research goal is to enable an agent effectively learn to act (cooperate, compete) in the presence of other learning agents in complex domains.
- Equipping MAS with learning capabilities permits the agent to deal with large, open, dynamic, and unpredictable environments
- Multi-agent learning (MAL) is a challenging problem for developing intelligent systems.
- Multiagent environments are non-stationary, violating the traditional assumption underlying single-agent learning

Reinforcement Learning Papers: Statistics

Google Scholar

Various Approaches to Learning / Related Work

Y. Shohamet al., 2003

MDP and Stochastic Game Frameworks

Observations,

Sensations

Learning

Algorithm

World,

State

Actions

Policy

Single-agent Reinforcement Learning- Independent learners act ignoring the existence of others
- Stationary environment
- Learn policy that maximizes individual utility (“trial-error”)
- Perform their actions, obtain a reward and update their Q-values without regard to the actions performed by others

R. S. Sutton, 1997

r

r

r

. . .

. . .

t +1

t +2

t +f = 0

s

s

t +3

s

s

t +1

t +2

t +3

a

a

a

a

a

t

s

t +1

t +2

t

t +3

t+f-1

t +f

Markov Decision Processes / MDP FrameworkEnvironment is a modeled as an MDP, defined by (S, A, R, T)

S – finite set of states of the environment

A(s) – set of actions possible in state sS

T: S×A → P – set transition function from state-action pairs to states

R(s,s',a)– expected reward on transition (s to s‘)

P(s,s',a) – probability of transition from s to s'

– discount rate for delayed reward

Each discrete time t = 0, 1, 2, . . . agent:

- observes state StS
- chooses action atA
- receives immediate reward rt,
- state changes to St+1

T. M. Mitchell, 1997

Find a policy sS aA(s)

that maximizes thevalue (expectedfuture reward) of each s :

and each s,apair:

V (s) = E{r+ r+r+ s =s, }

2

. . .

t +1t +2t +3 t

rewards

Q (s,a) = E{r+ r+r+ s =s, a =a, }

2

. . .

t +1t +2t +3 t t

Agent’s learning task – find optimal action selection policyExecute actions in environment, observe results, and learn to construct an optimal action selection policy that maximizes the agent's performance - the long-term Total Discounted Reward

T. M. Mitchell, 1997

Agent’s Learning Strategy – Q-Learning method

- Q-function - iterative approximation of Q values with learning rateβ: 0≤ β<1
- Q-Learning incremental process
- Observe the current state s
- Select an action with probability based on the employed selection policy
- Observe the new state s′
- Receive a reward r from the environment
- Update the corresponding Q-value for action a and state s
- Terminate the current trial if the new state s′ satisfies a terminal condition; otherwise let s′→ sand go back to step 1

Multi-agent Framework

- Learning in multi-agent setting
- all agents simultaneously learning
- environment not stationary (other agents are evolving)
- problem of a “moving target”

Stochastic Game Framework for addressing MAL

From the perspective of sequential decision making:

- Markov decision processes
- one decision maker
- multiple states

- Repeated games
- multiple decision makers
- one state

- Stochastic games (Markov games)
- extension of MDPs to multiple decision makers
- multiple states

T(s,a,s)

a2

s

[ ]

[

]

s

a1

R1(s,a), R2(s,a), …

[ ]

Stochastic Game / NotationS: Set of states (n-agent stage games)

Ri(s,a): Reward to player i in state s under joint action a

T(s,a,s): Probability of transition from s to state s on a

From dynamic programming approach:

Qi(s,a): Long-run payoff to i from s on athen equilibrium

Multiagent learning using a variable learning rate

Evaluation criteria for multi-agent learning

- Use of convergence to NE is problematic:
- Terminating criterion: Equilibrium identifies conditions under which learning can or should stop
- Easier to play in equilibrium as opposed to continued computation
- Nash equilibrium strategy has no “prescriptive force”: say anything prior to termination
- Multiple potential equilibria
- Opponent may not wish to play an equilibria
- Calculating a Nash Equilibrium can be intractable for large games

- New criteria: rationality and convergence in self-play
- Converge to stationary policy: not necessarily Nash
- Only terminates once best response to play of other agents is found
- During self play, learning is only terminated in a stationary NE

Contributions and Assumptions

- Contributions:
- Criterion for multi-agent learning algorithms
- A simple Q-learning algorithm that can play mixed strategies
- The WoLF PHC (Win or Lose Fast Policy Hill Climber)

- Assumptions - gets both properties given that:
- The game is two-player, two-action
- Players can observe each other’s mixed strategies (not just the played action)
- Can use infinitesimally small step sizes

Opponent Modeling or Joint-Action Learners

C. Claus, C. Boutilier, 1998

Joint-Action Learners Method

- Maintains an explicit model of the opponents for each state.
- Q-values are maintained for all possible joint actions at a given state
- The key assumption is that the opponent is stationary
- Thus, the model of the opponent is simply frequencies of actions played in the past
- Probability of playing action a-i:
- where C(a−i) is the number of times the opponent has played action a−i.
- n(s) is the number of times state s has been visited.

WoLF Principles

- The idea is to use two different strategy update steps, one for winning and another one for loosing situations
- “Win or Learn Fast”: agent reduces its learning rate when performing well, and increases when doing badly. Improves convergence of IGA and policy hill-climbing
- To distinguish between those situations, the player keeps track of two policies.
- Winning is considered if the expected utility of the actual policy is greater than the expected utility of the equilibrium (or average) policy.
- If winning, the smaller of two strategy update steps is chosen by the winning agent.

Incremental Gradient Ascent Learners (IGA)

- IGA:
- incrementally climbs on the mixed strategy space
- for 2-player 2-action general sum games
- guarantees convergence to a Nash equilibrium or guarantees convergence to an average payoff that is sustained by some Nash equilibrium

- WoLF IGA:
- based on WoLF principle
- converges guarantee to a Nash equilibrium for all 2 player 2 action general sum games

Simple Q-Learner that plays mixed strategies

Problems:

- guarantees rationality against stationary opponents
- does not converge in self-play

Updating a mixed strategy by giving more weight to the action that Q-learning believes is the best

WoLF Policy Hill Climbing algorithm

- agent only need to see its own payoff
- converges for two player two action SG’s in self-play

Maintaining average policy

Probability of playing action

Determination of “W” and “L”: by comparing the expected value of the current policy to that of the average policy

Analysis of the replicator dynamics

Replicator Dynamics – Simplification Case

Best response dynamics for Paper-Rock-Scissors

Circular shift from one agent’s policy to the other’s average reward

A winning strategy against PHC

1

If winning

play probability 1 for

current preferred action

in order to maximize

rewards while winning

If losing

play a deceiving policy

until we are ready to take

advantage of them again

0.5

Probability opponent plays heads

0

1

0.5

Probability we play heads

Convergence dynamics of strategies

- Iterated Gradient Ascent:
- Again does a myopic adaptation to other players’ current strategy.
- Either converges to a Nash fixed point on the boundary (at least one pure strategy), or get limit cycles
- Vary learning rates to be optimal while satisfying both properties

Experimental testbeds

- Matrix Games
- Matching pennies
- Three-player matching pennies
- Rock-paper-scissors

- Gridworld
- Soccer

Summary and Conclusion

- Criterion for multi-agent learning algorithms: rationality and convergence
- A simple Q-learning algorithm that can play mixed strategies
- The WoLF PHC (Win or Lose Fast Policy Hill Climber) to satisfy rationality and convergence

Disadvantages

- Analysis for two-player, two-action games: pseudoconvergence
- Avoidance of exploitation
- guaranteeing that the learner cannot be deceptively exploited by another agent
- Chang and Kaelbling (2001) demonstrated that the best-response learner

- PHC (Bowling & Veloso, 2002) could be exploited by a particular dynamic strategy.

Future Work by Authors

- Exploring learning outside of self-play:whether WoLF techniques can be exploited by a malicious (not rational) “learner”.
- Scaling to large problems:combining single-agent scaling solutions (function approximators and parameterized policies) with the concepts of a variable learning rate and WoLF.
- Online learning
- List other algorithms of authors:
- GIGA-WoLF, normal form games

Discussion / Open Questions

- Investigation other evaluation criteria:
- No-regret criteria
- Negative non-convergence regret (NNR)
- Fast reaction (tracking) [Jensen]
- Performance: maximum time for reaching a desired performance level

- Incorporating more algorithms into testing: deeper comparison with more simple and more complex algorithms (e.g. AWESOME [Conitzer and Sandholm 2003])
- Classification of situations (games) with various values of the delta and alpha variables: what values are good in what situations.
- Extending work to have more players.
- Online learning and exploration policy in stochastic games (trade-off)
- Currently the formalism is presented in two dimensional state-space: possibility for extension of the formal model (geometrical ?)?
- What does make Minimax-Q irrational?
- Application of WoLF to multi-agent evolutionary algorithms (e.g. to control the mutation rate) or to learning of neural networks (e.g. to determine a winner neuron)?
- Connection with control theory and learning of Complex Adaptive Systems: manifold-adaptive learning?

Thank you

Download Presentation

Connecting to Server..