- By
**beata** - Follow User

- 105 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Response Regret' - beata

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Response Regret

Outline

Outline

Martin Zinkevich

AAAI Fall Symposium

November 5th, 2005

This work was supported by NSF Career Grant #IIS-0133689.

Outline

- Introduction
- Repeated Prisoners’ Dilemma
- Tit-for-Tat
- Grim Trigger

- Traditional Regret
- Response Regret
- Conclusion

The Prisoner’s Dilemma

- Two prisoners (Alice and Bob) are caught for a small crime. They make a deal not to squeal on each other for a large crime.
- Then, the authorities meet with each prisoner separately, and offer a pardon for the small crime if the prisoner turns (his/her) partner in for the large crime.
- Each has two options:
- Cooperatewith (his/her) fellow prisoner, or
- Defect from the deal.

Bimatrix Game

Bob

Cooperates

Bob

Defects

Alice

Cooperates

Alice: 1 year

Bob: 1 year

Alice: 6 years

Bob: 0 years

Alice

Defects

Alice: 0 years

Bob: 6 years

Alice: 5 years

Bob: 5 years

The Problem

- Each acting to slightly improve his/her circumstances hurts the other player, such that if they both acted “irrationally”, they would both do better.

A Better Model for Real Life

- Consequences for misbehavior
- These improve life
- A better model: Infinitely repeated games

The Goal

- Can we come up with algorithms with performance guarantees in the presence of other intelligent agents which take into account the delayed consequences?
- Side effect: a goal for reinforcement learning in infinite POMDPs.

Regret Versus Standard RL

- Guarantees of performance during learning.
- No guarantee for the “final” policy…
…for now.

A New Measure of Regret

- Traditional Regret
- measures immediate consequences

- Response Regret
- measures delayed effects

Outline

- Introduction
- Repeated Prisoners’ Dilemma
- Tit-for-Tat
- Grim Trigger

- Traditional Regret
- Response Regret
- Conclusion

Outline

- Introduction
- Repeated Prisoners’ Dilemma
- Tit-for-Tat
- Grim Trigger

- Traditional Regret
- Response Regret
- Conclusion

Repeated Bimatrix Game

Bob

Cooperates

Bob

Defects

Alice

Cooperates

-1,-1

-6,0

Alice

Defects

0,-6

-5,-5

Finite State Machine (for Bob)

Alice

defects

Bob

cooperates

Bob

cooperates

Alice

cooperates

Alice

cooperates

Alice

defects

Bob

defects

Alice

*

Tit-for-Tat

Alice

defects

Bob

cooperates

Bob

defects

Alice

cooperates

Alice

defects

Alice

cooperates

GO

Discounted UtilityPr[ ]=1/3

STOP

Alice

defects

Bob

cooperates

Bob

defects

Alice

cooperates

Alice

defects

Alice

cooperates

GO

GO

GO

STOP

STOP

STOP

GO

GO

GO

GO

STOP

STOP

STOP

STOP

GO

GO

GO

GO

STOP

STOP

STOP

STOP

GO

C -1

C -1

D 0

C -6

D 0

C -6

GO

GO

C -6

D 0

GO

STOP

Discounted Utility

- The expected value of that process
- t=11 utt-1

Optimal Value Functions for FSMs

- V*(s) discounted utility of OPTIMAL policy from state s
- V0*(s) immediate maximum utility at state s
- V*(B) discounted utility of OPTIMAL policy given belief over states B
- V0*(B) immediate maximum utility given belief over states B

Pr[ ]=(1-)

STOP

Pr[ ]=

GO

Best Responses, Discounted Utility

- If >1/5, a policy is a best response to grim trigger iff it always cooperates when playing grim trigger.

Alice

defects

Bob

cooperates

Bob

defects

Alice

cooperates

Alice

*

defects

Bob

cooperates

Bob

defects

Alice

cooperates

Alice

defects

Alice

cooperates

Best Responses, Discounted Utility- Similarly, if >1/5, a policy is a best response to tit-for-tat iff it always cooperates when playing tit-for-tat.

Knowing Versus Learning

- Given a known FSM for the opponent, we can determine the optimal policy (for some ) from an initial state.
- However, if it is an unknown FSM, by the time we learn what it is, it will be too late to act optimally.

Grim Trigger or Always Cooperate?

Grim Trigger

Always Cooperate

Alice

defects

Bob

cooperates

Bob

defects

Bob

cooperates

Alice

cooperates

Alice

*

Alice

*

For learning, optimality from the initial state

is a bad goal.

New Goal

- Can a measure of regret allow us to play like tit-for-tat in the Infinitely Repeated Prisoner’s Dilemma?
- In addition, it should be possible for one algorithm to minimize regret against all possible opponents (finite and infinite SMs).

- Introduction
- Repeated Prisoners’ Dilemma
- Tit-for-Tat
- Grim Trigger

- Traditional Regret
- Response Regret
- Conclusion

Traditional Regret:Rock-Paper-Scissors

Traditional Regret:Rock-Paper-Scissors

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Utility of the Algorithm

- Define ut to be the utility of ALG at time t.
- Define u0ALG to be:
u0ALG=(1/T)t=1T ut

- Here:
u0ALG=(1/5)(0+1+(-1)+1+0)=1/5

u0ALG=1/5

Rock-Paper-Scissors Dropped according to Frequencies

3/5 Visits

1/5 Visits

u0ALG=1/5

0

1/5 Visits

2/5

-2/5

Traditional Regret

- Consider B to be the empirical frequency states were visited.
- Define u0ALG to be the average utility of the algorithm.
- Traditional regret of ALG is:
R= V0*(B)-u0ALG

R=(2/5)-(1/5)

u0ALG=1/5

0

2/5

-2/5

Traditional Regret

- Goal: regret approach zero a.s.
- Exists an algorithm that will do this for all opponents.

What Algorithm?

- Gradient Ascent With Euclidean Projection (Zinkevich, 2003):
- (when pi strictly positive)

What Algorithm?

- Exponential Weighted Experts (Littlestone + Warmuth, 1994):
- And a close relative:

What Algorithm?

- Regret Matching:

What Algorithm?

- Lots of them!

Extensions to Traditional Regret

(Foster and Vohra, 1997)

- Into the past…
- Have a short history
- Optimal against BR to Alice’s Last.

Extensions to Traditional Regret

- (Auer et al)
- Only see ut, not ui,t:
- Use an unbiased estimator of ui,t:

- Introduction
- Repeated Prisoners’ Dilemma
- Tit-for-Tat
- Grim Trigger

- Traditional Regret
- Response Regret
- Conclusion

This Talk

- Do you want to?
- Even then, is it possible?

Traditional Regret:Prisoner’s Dilemma

Alice

defects

CCDCDD

DD

DD

DD

DD

DD

DD

DD

Bob

cooperates

Bob

defects

Alice

cooperates

Alice

defects

Alice

cooperates

Traditional Regret:Prisoner’s Dilemma

Alice

defects

Bob

cooperates

(0.2)

Bob

defects

(0.8)

Alice

cooperates

Alice

defects

Alice

cooperates

Alice defects: -4

Alice cooperates: -5

The New Dilemma

- Traditional regret forces greedy, short-sighted behavior.
- A new concept is needed.

defects

Bob

cooperates

(0.2)

Bob

defects

(0.8)

Alice

cooperates

Alice

defects

Alice

cooperates

A New Measurement of RegretV*(B)

instead of V0*(B)

Response Regret

- Consider B to be the empirical distribution over states visited.
- Define u0ALG to be the average utility of the algorithm.
- Traditional regret is:
R0= V0*(B)-u0ALG

- Response regret is:
R= Vg*(B)-?

Averaged Discounted Utility

Utility of algorithm at time t’=

ut’

Discounted utility from time t=

t’=t1 ut’t’-t

Averaged discounted utility from 1 to T

uALG=(1/T)t=1Tt’=t1 ut’t’-t

Dropped in at random but play optimally:

Vg*(B)

Response Regret

R= Vg*(B)-uALG

Response Regret

- Consider B to be the empirical distribution over states visited.
- Traditional regret is:
R0= V0*(B)-u0ALG

- Response regret is:
R= Vg*(B)-uALG

defects

Bob

cooperates

(0.2)

Bob

defects

(0.8)

Alice

cooperates

Alice

defects

Alice

cooperates

Comparing Regret Measures:when Bob Plays Tit-for-TatCCDCDD

DD

DD

DD

DD

DD

DD

DD

R0=1/10 (defect)

R1/5=0 (any policy)

R2/3=(203/30)¼6.76 (always cooperate)

defects

Bob

cooperates

(1.0)

Bob

defects

(0.0)

Alice

cooperates

Alice

defects

Alice

cooperates

Comparing Regret Measures: when Bob Plays Tit-for-TatCCCCCC

CC

CC

CC

CC

CC

CC

CC

R0=1 (defect)

R1/5=0 (any policy)

R2/3=0 (always cooperate/tit-for-tat/grim trigger)

defects

Bob

cooperates

(0.2)

Bob

defects

(0.8)

Alice

cooperates

Alice

*

Comparing Regret Measures: when Bob Plays Grim TriggerCCDCDD

DD

DD

DD

DD

DD

DD

DD

R0=1/10 (defect)

R1/5=0 (grim trigger/tit-for-tat/always defect)

R2/3=11/30 (grim trigger/tit-for-tat)

defects

Bob

cooperates

(1.0)

Bob

defects

(0.0)

Alice

cooperates

Alice

*

Comparing Regret Measures: when Bob Plays Grim TriggerCCCCCC

CC

CC

CC

CC

CC

CC

CC

R0=1 (defect)

R1/5=0 (always cooperate/always defect/tit-for-tat/grim trigger)

R2/3=0 (always cooperate/tit-for-tat/grim trigger)

What it Measures:

- constant opportunities
- high response regret

- a few drastic mistakes
- low response regret

- convergence implies Nash Equilibrium of the repeated game

Philosophy

- Response regret cannot be known without knowing the opponent.
- Response regret can be estimated while playing the opponent, so that the estimate in the limit will be exact a.s.

Determining Utility of a Policyin a State

- If I want to know the discounted utility of using a policy P from the third state visited…
- Use the policy P from the third time step ad infinitum, and take the discounted reward.

S1

S2

S3

S4

S5

Determining Utility of a Policyin a State in Finite Time

- Start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used.
- In EXPECTATION, the same as before.

S1

S2

S3

S4

S5

Determining Utility of a Policyin a State in Finite Time Without ALWAYS Using It

- With a probability , start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used and multiply it by 1/.
- In EXPECTATION, the same as before.
- Can estimate any finite number of policies at the same time this way.

S1

S2

S3

S4

S5

Traditional Regret

- Goal: regret approach zero a.s.
- Exists an algorithm for all opponents.

Response Regret

- Goal: regret approach zero a.s.
- Exists an algorithm for all opponents.

SPEED!

- Response regret takes time to minimize (combination lock problem).
- Current work: restricting the adversary’s choice of policies. In particular, if the number of policies is N, then the regret is linear in N and polynomial in 1/(1-).

Related Work

- Other work
- De Farias and Meggido 2004
- Browning, Bowling, and Veloso 2004
- Bowling and McCracken 2005

- Episodic solutions: similar problems to Finitely Repeated Prisoner’s Dilemma.

What is in a Name?

- Why not Consequence Regret?

Questions?

Thanks to:

Avrim Blum (CMU)

Michael Bowling (U Alberta)

Amy Greenwald (Brown)

Michael Littman (Rutgers)

Rich Sutton (U Alberta)

Practice

- Using these estimation techniques, it is possible to minimize response regret (make it approach zero almost surely in the limit in an ARBITRARY environment).
- Similar to the Folk Theorems, it is also possible to converge to the socially optimal behavior if is close enough to 1.(???)

defects

Bob

cooperates

Bob

defects

Alice

cooperates

Alice

defects

Alice

cooperates

Traditional Regret:Prisoner’s DilemmaPossible Outcomes

- Alice cooperates, Bob cooperates:
- Alice: 1 year
- Bob: 1 year

- Alice defects, Bob cooperates:
- Alice: 0 years
- Bob: 6 years

- Alice cooperates, Bob defects:
- Alice: 6 years
- Bob: 0 years

- Alice defects, Bob defects:
- Alice: 5 years
- Bob: 5 years

Repeated Bimatrix Game

- The same one-shot game is played repeatedly.
- Either average reward or discounted reward is considered.

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Rock-Paper-Scissors Bob plays BR to Alice’s Last

One Slide Summary

- Problem: Prisoner’s Dilemma
- Solution: Infinitely Repeated Prisoner’s Dilemma
- Same Problem: Traditional Regret
- Solution: Response Regret

Formalism for FSMs (S,A,,O,u,T)

- States S
- Finite actions A
- Finite observations
- Observation function O:S!
- Utility function u:S£A!R
- (or u:S£O!R)

- Transition function T:S£A!S
- V*(s)=maxa2 A [u(s,a)+V*(T(s,a))]

T(s,a) state

O(s) observation

u(s,a) value

V*(s)=maxa2A

[u(s,a)+

V*(T(s,a))]

Suppose B is a distribution over states.

T(B,a,o) belief

O(B,o) probability

u(B,a) expected value

V*(B)=maxa2A

[u(B,a)+

o2O(B,o)V*(T(B,a,o))]

Beliefs
Download Presentation

Connecting to Server..