Download Presentation
## Response Regret

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Response Regret**Martin Zinkevich AAAI Fall Symposium November 5th, 2005 This work was supported by NSF Career Grant #IIS-0133689.**Outline**• Introduction • Repeated Prisoners’ Dilemma • Tit-for-Tat • Grim Trigger • Traditional Regret • Response Regret • Conclusion**The Prisoner’s Dilemma**• Two prisoners (Alice and Bob) are caught for a small crime. They make a deal not to squeal on each other for a large crime. • Then, the authorities meet with each prisoner separately, and offer a pardon for the small crime if the prisoner turns (his/her) partner in for the large crime. • Each has two options: • Cooperatewith (his/her) fellow prisoner, or • Defect from the deal.**Bimatrix Game**Bob Cooperates Bob Defects Alice Cooperates Alice: 1 year Bob: 1 year Alice: 6 years Bob: 0 years Alice Defects Alice: 0 years Bob: 6 years Alice: 5 years Bob: 5 years**The Problem**• Each acting to slightly improve his/her circumstances hurts the other player, such that if they both acted “irrationally”, they would both do better.**A Better Model for Real Life**• Consequences for misbehavior • These improve life • A better model: Infinitely repeated games**The Goal**• Can we come up with algorithms with performance guarantees in the presence of other intelligent agents which take into account the delayed consequences? • Side effect: a goal for reinforcement learning in infinite POMDPs.**Regret Versus Standard RL**• Guarantees of performance during learning. • No guarantee for the “final” policy… …for now. **A New Measure of Regret**• Traditional Regret • measures immediate consequences • Response Regret • measures delayed effects**Outline**• Introduction • Repeated Prisoners’ Dilemma • Tit-for-Tat • Grim Trigger • Traditional Regret • Response Regret • Conclusion**Outline**• Introduction • Repeated Prisoners’ Dilemma • Tit-for-Tat • Grim Trigger • Traditional Regret • Response Regret • Conclusion**Repeated Bimatrix Game**Bob Cooperates Bob Defects Alice Cooperates -1,-1 -6,0 Alice Defects 0,-6 -5,-5**Finite State Machine (for Bob)**Alice defects Bob cooperates Bob cooperates Alice cooperates Alice cooperates Alice defects Bob defects Alice ***Alice**defects Bob cooperates Bob defects Alice cooperates Alice * Grim Trigger**Always Cooperate**Bob cooperates Alice ***Always Defect**Bob defects Alice ***Tit-for-Tat**Alice defects Bob cooperates Bob defects Alice cooperates Alice defects Alice cooperates**Pr[ ]=2/3**GO Discounted Utility Pr[ ]=1/3 STOP Alice defects Bob cooperates Bob defects Alice cooperates Alice defects Alice cooperates GO GO GO STOP STOP STOP GO GO GO GO STOP STOP STOP STOP GO GO GO GO STOP STOP STOP STOP GO C -1 C -1 D 0 C -6 D 0 C -6 GO GO C -6 D 0 GO STOP**Discounted Utility**• The expected value of that process • t=11 utt-1**Optimal Value Functions for FSMs**• V*(s) discounted utility of OPTIMAL policy from state s • V0*(s) immediate maximum utility at state s • V*(B) discounted utility of OPTIMAL policy given belief over states B • V0*(B) immediate maximum utility given belief over states B Pr[ ]=(1-) STOP Pr[ ]= GO**Best Responses, Discounted Utility**• If >1/5, a policy is a best response to grim trigger iff it always cooperates when playing grim trigger. Alice defects Bob cooperates Bob defects Alice cooperates Alice ***Alice**defects Bob cooperates Bob defects Alice cooperates Alice defects Alice cooperates Best Responses, Discounted Utility • Similarly, if >1/5, a policy is a best response to tit-for-tat iff it always cooperates when playing tit-for-tat.**Knowing Versus Learning**• Given a known FSM for the opponent, we can determine the optimal policy (for some ) from an initial state. • However, if it is an unknown FSM, by the time we learn what it is, it will be too late to act optimally.**Grim Trigger or Always Cooperate?**Grim Trigger Always Cooperate Alice defects Bob cooperates Bob defects Bob cooperates Alice cooperates Alice * Alice * For learning, optimality from the initial state is a bad goal.**Deterministic Infinite SMs**• represent any deterministic policy • de-randomization D D D C D C C**New Goal**• Can a measure of regret allow us to play like tit-for-tat in the Infinitely Repeated Prisoner’s Dilemma? • In addition, it should be possible for one algorithm to minimize regret against all possible opponents (finite and infinite SMs).**Outline**• Introduction • Repeated Prisoners’ Dilemma • Tit-for-Tat • Grim Trigger • Traditional Regret • Response Regret • Conclusion**Utility of the Algorithm**• Define ut to be the utility of ALG at time t. • Define u0ALG to be: u0ALG=(1/T)t=1T ut • Here: u0ALG=(1/5)(0+1+(-1)+1+0)=1/5 u0ALG=1/5**Rock-Paper-Scissors Visit Counts for Bob’s Internal States**3 Visits 1 Visit u0ALG=1/5 1 Visit**Rock-Paper-Scissors Frequencies**3/5 Visits 1/5 Visits u0ALG=1/5 1/5 Visits**Rock-Paper-Scissors Dropped according to Frequencies**3/5 Visits 1/5 Visits u0ALG=1/5 0 1/5 Visits 2/5 -2/5**Traditional Regret**• Consider B to be the empirical frequency states were visited. • Define u0ALG to be the average utility of the algorithm. • Traditional regret of ALG is: R= V0*(B)-u0ALG R=(2/5)-(1/5) u0ALG=1/5 0 2/5 -2/5**Traditional Regret**• Goal: regret approach zero a.s. • Exists an algorithm that will do this for all opponents.**What Algorithm?**• Gradient Ascent With Euclidean Projection (Zinkevich, 2003): • (when pi strictly positive)**What Algorithm?**• Exponential Weighted Experts (Littlestone + Warmuth, 1994): • And a close relative:**What Algorithm?**• Regret Matching:**What Algorithm?**• Lots of them!**Extensions to Traditional Regret**(Foster and Vohra, 1997) • Into the past… • Have a short history • Optimal against BR to Alice’s Last.**Extensions to Traditional Regret**• (Auer et al) • Only see ut, not ui,t: • Use an unbiased estimator of ui,t:**Outline**• Introduction • Repeated Prisoners’ Dilemma • Tit-for-Tat • Grim Trigger • Traditional Regret • Response Regret • Conclusion**This Talk**• Do you want to? • Even then, is it possible?**Traditional Regret:Prisoner’s Dilemma**Alice defects CCDCDD DD DD DD DD DD DD DD Bob cooperates Bob defects Alice cooperates Alice defects Alice cooperates**Traditional Regret:Prisoner’s Dilemma**Alice defects Bob cooperates (0.2) Bob defects (0.8) Alice cooperates Alice defects Alice cooperates Alice defects: -4 Alice cooperates: -5