Response regret
Download
1 / 87

Response Regret - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Response Regret. Martin Zinkevich AAAI Fall Symposium November 5 th , 2005 This work was supported by NSF Career Grant #IIS-0133689. Outline. Introduction Repeated Prisoners’ Dilemma Tit-for-Tat Grim Trigger Traditional Regret Response Regret Conclusion. The Prisoner’s Dilemma.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Response Regret' - beata


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Response regret

Response Regret

Martin Zinkevich

AAAI Fall Symposium

November 5th, 2005

This work was supported by NSF Career Grant #IIS-0133689.


Outline
Outline

  • Introduction

  • Repeated Prisoners’ Dilemma

    • Tit-for-Tat

    • Grim Trigger

  • Traditional Regret

  • Response Regret

  • Conclusion


The prisoner s dilemma
The Prisoner’s Dilemma

  • Two prisoners (Alice and Bob) are caught for a small crime. They make a deal not to squeal on each other for a large crime.

  • Then, the authorities meet with each prisoner separately, and offer a pardon for the small crime if the prisoner turns (his/her) partner in for the large crime.

  • Each has two options:

    • Cooperatewith (his/her) fellow prisoner, or

    • Defect from the deal.


Bimatrix game
Bimatrix Game

Bob

Cooperates

Bob

Defects

Alice

Cooperates

Alice: 1 year

Bob: 1 year

Alice: 6 years

Bob: 0 years

Alice

Defects

Alice: 0 years

Bob: 6 years

Alice: 5 years

Bob: 5 years




The problem
The Problem

  • Each acting to slightly improve his/her circumstances hurts the other player, such that if they both acted “irrationally”, they would both do better.


A better model for real life
A Better Model for Real Life

  • Consequences for misbehavior

  • These improve life

  • A better model: Infinitely repeated games


The goal
The Goal

  • Can we come up with algorithms with performance guarantees in the presence of other intelligent agents which take into account the delayed consequences?

  • Side effect: a goal for reinforcement learning in infinite POMDPs.


Regret versus standard rl
Regret Versus Standard RL

  • Guarantees of performance during learning.

  • No guarantee for the “final” policy…

    …for now. 


A new measure of regret
A New Measure of Regret

  • Traditional Regret

    • measures immediate consequences

  • Response Regret

    • measures delayed effects


Outline1
Outline

  • Introduction

  • Repeated Prisoners’ Dilemma

    • Tit-for-Tat

    • Grim Trigger

  • Traditional Regret

  • Response Regret

  • Conclusion


Outline2
Outline

  • Introduction

  • Repeated Prisoners’ Dilemma

    • Tit-for-Tat

    • Grim Trigger

  • Traditional Regret

  • Response Regret

  • Conclusion


Repeated bimatrix game
Repeated Bimatrix Game

Bob

Cooperates

Bob

Defects

Alice

Cooperates

-1,-1

-6,0

Alice

Defects

0,-6

-5,-5


Finite state machine for bob
Finite State Machine (for Bob)

Alice

defects

Bob

cooperates

Bob

cooperates

Alice

cooperates

Alice

cooperates

Alice

defects

Bob

defects

Alice

*


Grim trigger

Alice

defects

Bob

cooperates

Bob

defects

Alice

cooperates

Alice

*

Grim Trigger


Always cooperate
Always Cooperate

Bob

cooperates

Alice

*


Always defect
Always Defect

Bob

defects

Alice

*


Tit for tat
Tit-for-Tat

Alice

defects

Bob

cooperates

Bob

defects

Alice

cooperates

Alice

defects

Alice

cooperates


Discounted utility

Pr[ ]=2/3

GO

Discounted Utility

Pr[ ]=1/3

STOP

Alice

defects

Bob

cooperates

Bob

defects

Alice

cooperates

Alice

defects

Alice

cooperates

GO

GO

GO

STOP

STOP

STOP

GO

GO

GO

GO

STOP

STOP

STOP

STOP

GO

GO

GO

GO

STOP

STOP

STOP

STOP

GO

C -1

C -1

D 0

C -6

D 0

C -6

GO

GO

C -6

D 0

GO

STOP


Discounted utility1
Discounted Utility

  • The expected value of that process

  • t=11 utt-1


Optimal value functions for fsms
Optimal Value Functions for FSMs

  • V*(s) discounted utility of OPTIMAL policy from state s

  • V0*(s) immediate maximum utility at state s

  • V*(B) discounted utility of OPTIMAL policy given belief over states B

  • V0*(B) immediate maximum utility given belief over states B

Pr[ ]=(1-)

STOP

Pr[ ]=

GO


Best responses discounted utility
Best Responses, Discounted Utility

  • If >1/5, a policy is a best response to grim trigger iff it always cooperates when playing grim trigger.

Alice

defects

Bob

cooperates

Bob

defects

Alice

cooperates

Alice

*


Best responses discounted utility1

Alice

defects

Bob

cooperates

Bob

defects

Alice

cooperates

Alice

defects

Alice

cooperates

Best Responses, Discounted Utility

  • Similarly, if >1/5, a policy is a best response to tit-for-tat iff it always cooperates when playing tit-for-tat.


Knowing versus learning
Knowing Versus Learning

  • Given a known FSM for the opponent, we can determine the optimal policy (for some ) from an initial state.

  • However, if it is an unknown FSM, by the time we learn what it is, it will be too late to act optimally.


Grim trigger or always cooperate
Grim Trigger or Always Cooperate?

Grim Trigger

Always Cooperate

Alice

defects

Bob

cooperates

Bob

defects

Bob

cooperates

Alice

cooperates

Alice

*

Alice

*

For learning, optimality from the initial state

is a bad goal.


Deterministic infinite sms
Deterministic Infinite SMs

  • represent any deterministic policy

  • de-randomization

D

D

D

C

D

C

C


New goal
New Goal

  • Can a measure of regret allow us to play like tit-for-tat in the Infinitely Repeated Prisoner’s Dilemma?

  • In addition, it should be possible for one algorithm to minimize regret against all possible opponents (finite and infinite SMs).


Outline3
Outline

  • Introduction

  • Repeated Prisoners’ Dilemma

    • Tit-for-Tat

    • Grim Trigger

  • Traditional Regret

  • Response Regret

  • Conclusion


Traditional regret rock paper scissors
Traditional Regret:Rock-Paper-Scissors


Traditional regret rock paper scissors1
Traditional Regret:Rock-Paper-Scissors


Rock paper scissors bob plays br to alice s last
Rock-Paper-Scissors Bob plays BR to Alice’s Last


Rock paper scissors bob plays br to alice s last1
Rock-Paper-Scissors Bob plays BR to Alice’s Last


Rock paper scissors bob plays br to alice s last2
Rock-Paper-Scissors Bob plays BR to Alice’s Last


Utility of the algorithm
Utility of the Algorithm

  • Define ut to be the utility of ALG at time t.

  • Define u0ALG to be:

    u0ALG=(1/T)t=1T ut

  • Here:

    u0ALG=(1/5)(0+1+(-1)+1+0)=1/5

u0ALG=1/5


Rock paper scissors visit counts for bob s internal states
Rock-Paper-Scissors Visit Counts for Bob’s Internal States

3 Visits

1 Visit

u0ALG=1/5

1 Visit


Rock paper scissors frequencies
Rock-Paper-Scissors Frequencies

3/5 Visits

1/5 Visits

u0ALG=1/5

1/5 Visits


Rock paper scissors dropped according to frequencies
Rock-Paper-Scissors Dropped according to Frequencies

3/5 Visits

1/5 Visits

u0ALG=1/5

0

1/5 Visits

2/5

-2/5


Traditional regret
Traditional Regret

  • Consider B to be the empirical frequency states were visited.

  • Define u0ALG to be the average utility of the algorithm.

  • Traditional regret of ALG is:

    R= V0*(B)-u0ALG

    R=(2/5)-(1/5)

u0ALG=1/5

0

2/5

-2/5


Traditional regret1
Traditional Regret

  • Goal: regret approach zero a.s.

  • Exists an algorithm that will do this for all opponents.


What algorithm
What Algorithm?

  • Gradient Ascent With Euclidean Projection (Zinkevich, 2003):

  • (when pi strictly positive)


What algorithm1
What Algorithm?

  • Exponential Weighted Experts (Littlestone + Warmuth, 1994):

  • And a close relative:


What algorithm2
What Algorithm?

  • Regret Matching:


What algorithm3
What Algorithm?

  • Lots of them!


Extensions to traditional regret
Extensions to Traditional Regret

(Foster and Vohra, 1997)

  • Into the past…

    • Have a short history

    • Optimal against BR to Alice’s Last.


Extensions to traditional regret1
Extensions to Traditional Regret

  • (Auer et al)

  • Only see ut, not ui,t:

  • Use an unbiased estimator of ui,t:


Outline4
Outline

  • Introduction

  • Repeated Prisoners’ Dilemma

    • Tit-for-Tat

    • Grim Trigger

  • Traditional Regret

  • Response Regret

  • Conclusion


This talk
This Talk

  • Do you want to?

  • Even then, is it possible?


Traditional regret prisoner s dilemma
Traditional Regret:Prisoner’s Dilemma

Alice

defects

CCDCDD

DD

DD

DD

DD

DD

DD

DD

Bob

cooperates

Bob

defects

Alice

cooperates

Alice

defects

Alice

cooperates


Traditional regret prisoner s dilemma1
Traditional Regret:Prisoner’s Dilemma

Alice

defects

Bob

cooperates

(0.2)

Bob

defects

(0.8)

Alice

cooperates

Alice

defects

Alice

cooperates

Alice defects: -4

Alice cooperates: -5



The new dilemma
The New Dilemma

  • Traditional regret forces greedy, short-sighted behavior.

  • A new concept is needed.


A new measurement of regret

Alice

defects

Bob

cooperates

(0.2)

Bob

defects

(0.8)

Alice

cooperates

Alice

defects

Alice

cooperates

A New Measurement of Regret

V*(B)

instead of V0*(B)


Response regret1
Response Regret

  • Consider B to be the empirical distribution over states visited.

  • Define u0ALG to be the average utility of the algorithm.

  • Traditional regret is:

    R0= V0*(B)-u0ALG

  • Response regret is:

    R= Vg*(B)-?


Averaged discounted utility
Averaged Discounted Utility

Utility of algorithm at time t’=

ut’

Discounted utility from time t=

t’=t1 ut’t’-t

Averaged discounted utility from 1 to T

uALG=(1/T)t=1Tt’=t1 ut’t’-t

Dropped in at random but play optimally:

Vg*(B)

Response Regret

R= Vg*(B)-uALG


Response regret2
Response Regret

  • Consider B to be the empirical distribution over states visited.

  • Traditional regret is:

    R0= V0*(B)-u0ALG

  • Response regret is:

    R= Vg*(B)-uALG


Comparing regret measures when bob plays tit for tat

Alice

defects

Bob

cooperates

(0.2)

Bob

defects

(0.8)

Alice

cooperates

Alice

defects

Alice

cooperates

Comparing Regret Measures:when Bob Plays Tit-for-Tat

CCDCDD

DD

DD

DD

DD

DD

DD

DD

R0=1/10 (defect)

R1/5=0 (any policy)

R2/3=(203/30)¼6.76 (always cooperate)


Comparing regret measures when bob plays tit for tat1

Alice

defects

Bob

cooperates

(1.0)

Bob

defects

(0.0)

Alice

cooperates

Alice

defects

Alice

cooperates

Comparing Regret Measures: when Bob Plays Tit-for-Tat

CCCCCC

CC

CC

CC

CC

CC

CC

CC

R0=1 (defect)

R1/5=0 (any policy)

R2/3=0 (always cooperate/tit-for-tat/grim trigger)


Comparing regret measures when bob plays grim trigger

Alice

defects

Bob

cooperates

(0.2)

Bob

defects

(0.8)

Alice

cooperates

Alice

*

Comparing Regret Measures: when Bob Plays Grim Trigger

CCDCDD

DD

DD

DD

DD

DD

DD

DD

R0=1/10 (defect)

R1/5=0 (grim trigger/tit-for-tat/always defect)

R2/3=11/30 (grim trigger/tit-for-tat)


Comparing regret measures when bob plays grim trigger1

Alice

defects

Bob

cooperates

(1.0)

Bob

defects

(0.0)

Alice

cooperates

Alice

*

Comparing Regret Measures: when Bob Plays Grim Trigger

CCCCCC

CC

CC

CC

CC

CC

CC

CC

R0=1 (defect)

R1/5=0 (always cooperate/always defect/tit-for-tat/grim trigger)

R2/3=0 (always cooperate/tit-for-tat/grim trigger)



What it measures
What it Measures:

  • constant opportunities

    • high response regret

  • a few drastic mistakes

    • low response regret

  • convergence implies Nash Equilibrium of the repeated game


Philosophy
Philosophy

  • Response regret cannot be known without knowing the opponent.

  • Response regret can be estimated while playing the opponent, so that the estimate in the limit will be exact a.s.


Determining utility of a policy in a state
Determining Utility of a Policyin a State

  • If I want to know the discounted utility of using a policy P from the third state visited…

  • Use the policy P from the third time step ad infinitum, and take the discounted reward. 

S1

S2

S3

S4

S5


Determining utility of a policy in a state in finite time
Determining Utility of a Policyin a State in Finite Time

  • Start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used.

  • In EXPECTATION, the same as before.

S1

S2

S3

S4

S5


Determining utility of a policy in a state in finite time without always using it
Determining Utility of a Policyin a State in Finite Time Without ALWAYS Using It

  • With a probability , start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used and multiply it by 1/.

  • In EXPECTATION, the same as before.

  • Can estimate any finite number of policies at the same time this way.

S1

S2

S3

S4

S5


Traditional regret3
Traditional Regret

  • Goal: regret approach zero a.s.

  • Exists an algorithm for all opponents.


Response regret3
Response Regret

  • Goal: regret approach zero a.s.

  • Exists an algorithm for all opponents.


A hard environment the combination lock problem
A Hard Environment:The Combination Lock Problem

Ad

Ad

Ac

Ac

B

d

B

d

B

d

B

d

B

c

A*

Ac

Ad

Ac


Speed
SPEED!

  • Response regret takes time to minimize (combination lock problem).

  • Current work: restricting the adversary’s choice of policies. In particular, if the number of policies is N, then the regret is linear in N and polynomial in 1/(1-).


Related work
Related Work

  • Other work

    • De Farias and Meggido 2004

    • Browning, Bowling, and Veloso 2004

    • Bowling and McCracken 2005

  • Episodic solutions: similar problems to Finitely Repeated Prisoner’s Dilemma.


What is in a name
What is in a Name?

  • Why not Consequence Regret?


Questions
Questions?

Thanks to:

Avrim Blum (CMU)

Michael Bowling (U Alberta)

Amy Greenwald (Brown)

Michael Littman (Rutgers)

Rich Sutton (U Alberta)


Always cooperate1
Always Cooperate

CCDCDD

DD

DD

DD

DD

DD

DD

DD

Bob

cooperates

Alice

*

R0=1/10

R1/5=1/10

R2/3=1/10


Practice
Practice

  • Using these estimation techniques, it is possible to minimize response regret (make it approach zero almost surely in the limit in an ARBITRARY environment).

  • Similar to the Folk Theorems, it is also possible to converge to the socially optimal behavior if  is close enough to 1.(???)


Traditional regret prisoner s dilemma2

Alice

defects

Bob

cooperates

Bob

defects

Alice

cooperates

Alice

defects

Alice

cooperates

Traditional Regret:Prisoner’s Dilemma


Possible outcomes
Possible Outcomes

  • Alice cooperates, Bob cooperates:

    • Alice: 1 year

    • Bob: 1 year

  • Alice defects, Bob cooperates:

    • Alice: 0 years

    • Bob: 6 years

  • Alice cooperates, Bob defects:

    • Alice: 6 years

    • Bob: 0 years

  • Alice defects, Bob defects:

    • Alice: 5 years

    • Bob: 5 years



Repeated bimatrix game1
Repeated Bimatrix Game

  • The same one-shot game is played repeatedly.

  • Either average reward or discounted reward is considered.


Rock paper scissors bob plays br to alice s last3
Rock-Paper-Scissors Bob plays BR to Alice’s Last


Rock paper scissors bob plays br to alice s last4
Rock-Paper-Scissors Bob plays BR to Alice’s Last


Rock paper scissors bob plays br to alice s last5
Rock-Paper-Scissors Bob plays BR to Alice’s Last


Rock paper scissors bob plays br to alice s last6
Rock-Paper-Scissors Bob plays BR to Alice’s Last


Rock paper scissors bob plays br to alice s last7
Rock-Paper-Scissors Bob plays BR to Alice’s Last


One slide summary
One Slide Summary

  • Problem: Prisoner’s Dilemma

  • Solution: Infinitely Repeated Prisoner’s Dilemma

  • Same Problem: Traditional Regret

  • Solution: Response Regret


Formalism for fsms s a o u t
Formalism for FSMs (S,A,,O,u,T)

  • States S

  • Finite actions A

  • Finite observations 

  • Observation function O:S!

  • Utility function u:S£A!R

    • (or u:S£O!R)

  • Transition function T:S£A!S

  • V*(s)=maxa2 A [u(s,a)+V*(T(s,a))]


Beliefs

Suppose S is a set of states.

T(s,a) state

O(s) observation

u(s,a) value

V*(s)=maxa2A

[u(s,a)+

V*(T(s,a))]

Suppose B is a distribution over states.

T(B,a,o) belief

O(B,o) probability

u(B,a) expected value

V*(B)=maxa2A

[u(B,a)+

o2O(B,o)V*(T(B,a,o))]

Beliefs


ad