- By
**ajaxe** - Follow User

- 103 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Regret Minimization in Stochastic Games' - ajaxe

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Regret Minimization in Stochastic Games

Shie Mannor and Nahum Shimkin

Technion, Israel Institute of Technology

Dept. of Electrical Engineering

UAI 2000

Introduction

- Modeling of a dynamic decision process as a stochastic game:
- Non stationarity of the environment
- Environments are not (necessarily) hostile
- Looking for the best possible strategy in light of the environment’s actions.

UAI 2000

Repeated Matrix Games

- The sets of single stage strategies P and Q are simplical.
- Rewards are defined by a reward matrix G: r(p,q)=pGq
- Reward criteria - average reward

Need not converge –stationarity is not

assumed

UAI 2000

Regret for Repeated Matrix Games

- Suppose by time t, average reward is , opponent empirical strategy is qt.
- The regret is defined as:

- A policy is called regret minimizing if:

UAI 2000

Regret minimization for repeated matrix games

- Such policies do exist (Hannan, 56)
- A proof using Approachability theory (Blackwell, 56)
- Also for games with partial observation (Auer et al. ,1995 ; Rustichini, 1999)

UAI 2000

Stochastic Games

- Formal Model:

S={1,…,s} state space

A=A(s) actions of Regret minimizing player, P1

B=B(s) actions of the “environment”, P2

r - reward function, r(s,a,b)

P - transition kernel, P(s`|s,a,b)

- Expected average for pP, qQ is r(p,q)
- Single state recurrence assumption

UAI 2000

Bayes Reward in Strategy Space

- For every stationary strategy qQ, the Bayes reward is defined as:
- Problems:
- P2’s strategy is not completely observed
- P1’s observations may depends on the strategies of both players

UAI 2000

Bayes Reward in State-Action Space

- Let psb be the observed frequency of P2’s action b and state s.
- A natural estimate of q is:

The associated Bayes envelope is:

UAI 2000

Approachability Theory

- A standard tool in the theory of repeated matrix games (Blackwell, 1956)
- For a game with vector reward and average reward
- A set is approachable by P1 with a policy s if:
- Was extended to recurrent stochastic games (Shimkin and Shwartz, 1993)

UAI 2000

The Convex Bayes Envelope

- In general BE is not approachable.
- Define CBE=co(BE), that is

where is the lower convex hull

of

Theorem: CBE is approachable.

(val is the value of the game)

UAI 2000

Single Controller Games

Theorem: Assume that P2 alone controls the transitions, i.e.

then BE itself is approachable.

UAI 2000

An Application to Prediction with Expert Advice

- Given a channel and a set of experts
- At each time epoch each expert states his prediction of the next symbol and P1 has to choose his prediction,
- Then a letter appears in the channel and P1 receives his prediction reward r(, )
- Problem can be formulated as stochastic game, P2 stands for all experts and the channel

UAI 2000

r(a,b)

r=0

0

0

(k-1,k,k)

(k,k,k)

Expert recommendation

Prediction Example (cont’)Theorem: P1 has a zero regret strategy.

UAI 2000

P=0.99

P=0.99

P=0.99

r=b

S1

r=b

S0

a=0

B(0)=B(1)={-1,1}

P=0.99

An example in which BE is not approachableIt can be proved that BE for the

above game is not approachable

UAI 2000

Open questions

- Characterization of minimal approachable sets in reward-state-actions space
- On-line learning schemes for stochastic games with unknown parameters
- Other ways of formulating optimality with respect to observed state action frequencies

UAI 2000

Conclusions

- The problem of regret minimization for stochastic games was considered
- The proposed solution concept, CBE, is based on convexification of the Bayes envelope in the natural state action space.
- The concept of CBE ensures an average reward that is higher than value when the opponent is sub optimal

UAI 2000

Regret Minimization in Stochastic Games

Shie Mannor and Nahum Shimkin

Technion, Israel Institute of Technology

Dept. of Electrical Engineering

UAI 2000

Approachability Theory

- Let m(p,q) be the average vector valued reward in a game when P1 and P2 play p and q
- Define
- Theorem [Blackwell 56]: A convex set C is approachable if and only if for every qQ
- Extended to stochastic games (Shimkin and Shwartz, 1993)

UAI 2000

A related Vector Valued Game

- Define the following vector valued game:
- If in state s action b is played by P2 and a reward r is gained then the vector valued mt :

UAI 2000

Download Presentation

Connecting to Server..