1 / 38

Online Learning + Game Theory

Online Learning + Game Theory. Your guide:. Avrim Blum. Carnegie Mellon University. Indo-US Lectures Week in Machine Learning, Game Theory, and Optimzation. Theme. Simple but powerful algorithms for online decision-making when you don’t have a lot of knowledge.

ishana
Download Presentation

Online Learning + Game Theory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Learning + Game Theory Your guide: Avrim Blum Carnegie Mellon University Indo-US Lectures Week in Machine Learning, Game Theory, and Optimzation

  2. Theme Simple but powerful algorithms for online decision-makingwhen you don’t have a lot of knowledge. Nice connections between these algorithms and game-theoretic notions of optimality and equilibrium. Start with something called the problem of “combining expert advice”.

  3. Using “expert” advice Say we want to predict the stock market. • We solicit n “experts” for their advice. (Will the market go up or down?) • We then want to use their advice somehow to make our prediction. E.g., Basic question: Is there a strategy that allows us to do nearly as well as best of these in hindsight? [“expert” = someone with an opinion. Not necessarily someone who knows anything.]

  4. Simpler question • We have n “experts”. • One of these is perfect (never makes a mistake). We just don’t know which one. • Can we find a strategy that makes no more than lg(n) mistakes? • Answer: sure. Just take majority vote over all experts that have been correct so far. • Each mistake cuts # available by factor of 2. • Note: this means ok for n to be very large.

  5. What if no expert is perfect? One idea: just run above protocol until all experts are crossed off, then repeat. Makes at most log(n) mistakes per mistake of the best expert (plus initial log(n)). Seems wasteful. Constantly forgetting what we've “learned”. Can we do better?

  6. What if no expert is perfect? Intuition: Making a mistake doesn't completely disqualify an expert. So, instead of crossing off, just lower its weight. Weighted Majority Alg: • Start with all experts having weight 1. • Predict based on weighted majority vote. • Penalize mistakes by cutting weight in half. Weights: 1 1 1 1 Predictions: U UU D Truth: D We predict: U Weights: ½ ½ ½ 1

  7. Analysis: do nearly as well as best expert in hindsight • M = # mistakes we've made so far. • m = # mistakes best expert has made so far. • W = total weight (starts at n). • After each mistake, W drops by at least 25%. So, after M mistakes, W is at most n(3/4)M. • Weight of best expert is (1/2)m. So, constant ratio So, if m is small, then M is pretty small too.

  8. Randomized Weighted Majority 2.4(m + lg n)not so good if the best expert makes a mistake 20% of the time. Can we do better?Yes. • Instead of taking majority vote, use weights as probabilities.(e.g., if 70% on up, 30% on down, then pick 70:30)Idea:smooth out the worst case. • Also, generalize ½ to 1- e. M = expected #mistakes unlike most worst-case bounds, numbers are pretty good.

  9. Analysis Ft • Say at time twe have fraction Ft of weight on experts that made mistake. • So, we have probability Ft of making a mistake, and we remove an eFt fraction of the total weight. • Wfinal = n(1-e F1)(1 - e F2)... • ln(Wfinal) = ln(n) + åt [ln(1 - e Ft)] ·ln(n) - eåt Ft (using ln(1-x) < -x) = ln(n) - e M. (å Ft = E[# mistakes]) • If best expert makes m mistakes, then ln(Wfinal) > ln((1-e)m). • Now solve: ln(n) - e M > m ln(1-e).

  10. Additive regret • So, have M · OPT + eOPT + 1/e log(n). • If set =(log(n)/OPT)1/2to balance the two terms out (or use guess-and-double), get bound of M ·OPT + 2(OPT¢)1/2 · OPT + 2(T )1/2 • These are called “additive regret” bounds. M/T · OPT/T + 2(log(n)/T)1/2 . “no regret” T = # time steps OPT

  11. Extensions • What if experts are actions? (rows in a matrix game, ways to drive to work,…) • At each time t, each has a loss (cost) in {0,1}. • Can still run the algorithm • Rather than viewing as “pick a prediction with prob proportional to its weight” , • View as “pick an expert with probability proportional to its weight” • Alg pays expected cost . • Same analysis applies. Do nearly as well as best action in hindsight!

  12. Extensions • What if losses (costs) in [0,1]? • Just modify alg update rule: . • Fraction of wt removed from system is: • Analysis very similar to case of {0,1}.

  13. RWM (multiplicative weightsalg) World – life - opponent (1-ec12) (1-ec22) (1-ec32) . . (1-ecn2) (1-ec11) (1-ec21) (1-ec31) . . (1-ecn1) 1 1 1 1 1 1 scaling so costs in [0,1] c1 c2 Guarantee: do nearly as well as fixed row in hindsight Which implies doing nearly as well (or better) than minimax optimal

  14. Connections to minimax optimality World – life - opponent (1-ec12) (1-ec22) (1-ec32) . . (1-ecn2) (1-ec11) (1-ec21) (1-ec31) . . (1-ecn1) 1 1 1 1 1 1 scaling so costs in [0,1] c2 If play RWM against a best-response oracle, will approach minimax optimality (most will be close). (If if didn’t, wouldn’t be getting promised guarantee)

  15. Connections to minimax optimality World – life - opponent (1-ec12) (1-ec22) (1-ec32) . . (1-ecn2) (1-ec11) (1-ec21) (1-ec31) . . (1-ecn1) 1 1 1 1 1 1 scaling so costs in [0,1] c2 If play two RWM against each other, then empirical distributions must be near-minimax-optimal. (Else, one or the other could & would take advantage)

  16. A natural generalization • A natural generalization of our regret goal (thinking of driving) is: what if we also want that on rainy days, we do nearly as well as the best route for rainy days. • And on Mondays, do nearly as well as best route for Mondays. • More generally, have N “rules” (on Monday, use path P). Goal: simultaneously, for each rule i, guarantee to do nearly as well as it on the time steps in which it fires. • For all i, want E[costi(alg)] · (1+)costi(i) + O(-1log N). (costi(X) = cost of X on time steps where rule i fires.) • Can we get this?

  17. A natural generalization • This generalization is esp natural in machine learning for combining multiple if-then rules. • E.g., document classification. Rule: “if <word-X> appears then predict <Y>”. E.g., if has football then classify as sports. • So, if 90% of documents with footballare about sports, we should have error · 11% on them. “Specialists” or “sleeping experts” problem. • Assume we have N rules. • For all i, want E[costi(alg)] · (1+)costi(i) + O(-1log N). (costi(X) = cost of X on time steps where rule i fires.)

  18. A simple algorithm and analysis (all on one slide) • Start with all rules at weight 1. • At each time step, of the rules i that fire, select one with probability pi/wi. • Update weights: • If didn’t fire, leave weight alone. • If did fire, raise or lower depending on performance compared to weighted average: • ri = [jpj cost(j)]/(1+) – cost(i) • wià <- wi(1+)ri • So, if rule i does exactly as well as weighted average, its weight drops a little. Weight increases if does better than weighted average by more than a (1+) factor. This ensures sum of weights doesn’t increase. • Final wi = (1+)E[costi(alg)]/(1+)-costi(i). So, exponent ·-1log N. • So, E[costi(alg)] · (1+)costi(i) + O(-1log N).

  19. Application: adapting to change • What if we want to adapt to change - do nearly as well as best recent expert? • For each expert, instantiate copy who wakes up on day t for each 0 · t · T-1. • Our cost in previous t days is at most (1+²)(best expert in last t days) + O(²-1 log(NT)). • (not best possible bound since extra log(T) but not bad).

  20. [ACFS02]: applying RWM to bandit setting • What if only get your own cost/benefit as feedback? • Use of RWM as subroutine to get algorithm with cumulative regret O( (TN log N)1/2 ). [average regret O( ((N log N)/T)1/2 ).] • Will do a somewhat weaker version of their analysis (same algorithm but not as tight a bound). • For fun, talk about it in the context of online pricing…

  21. Online pricing View each possible price as a different row/expert • Say you are selling lemonade (or a cool new software tool, or bottles of water at the world cup). • For t=1,2,…T • Seller sets price pt • Buyer arrives with valuation vt • If vt¸ pt, buyer purchases and pays pt, else doesn’t. • Repeat. $2 $3.00 a glass $500 a glass • Assume all valuations · h. • Goal: do nearly as well as best fixed price in hindsight. • If vt revealed, run RWM. E[gain] ¸OPT(1-²) - O(²-1 h log n).

  22. Multi-armed bandit problem • [Auer,Cesa-Bianchi,Freund,Schapire] OPT Exponential Weights for Exploration and Exploitation (exp3) RWM n = #experts qt Expert i ~ qt Distrib pt OPT OPT OPT Exp3 $1.25 Gain vector ĝt Gain git ·nh/° qt = (1-°)pt + ° unif ĝt = (0,…,0, git/qit,0,…,0) 1. RWM believes gain is: pt¢ ĝt = pit(git/qit) ´ gtRWM 3. Actual gain is: git = gtRWM (qit/pit) ¸ gtRWM(1-°) 2. tgtRWM¸ (1-²) - O(²-1nh/°log n) Because E[ĝjt] = (1- qjt)0 + qjt(gjt/qjt) = gjt , 4. E[ ] ¸OPT. so E[maxj[t ĝjt]] ¸ maxj [ E[t ĝjt] ] = OPT.

  23. Multi-armed bandit problem • [Auer,Cesa-Bianchi,Freund,Schapire] OPT Exponential Weights for Exploration and Exploitation (exp3) RWM n = #experts qt Expert i ~ qt Distrib pt OPT Exp3 $1.25 Gain vector ĝt Gain git ·nh/° qt = (1-°)pt + ° unif ĝt = (0,…,0, git/qit,0,…,0) Conclusion (° =²): E[Exp3] ¸OPT(1-²)2 - O(²-2nh log(n)) Balancing would give O((OPT nh log n)2/3) in bound because of ²-2. But can reduce to ²-1 and O((OPT nh log n)1/2) more care in analysis.

  24. Summary (of results so far) Algorithms for online decision-making with strong guarantees on performance compared to best fixed choice. • Application: play repeated game against adversary. Perform nearly as well as fixed strategy in hindsight. Can apply even with very limited feedback. • Application: which way to drive to work, with only feedback about your own paths; online pricing, even if only have buy/no buy feedback.

  25. Internal/Swap Regret and Correlated Equilibria

  26. What if all players minimize regret? • In zero-sum games, empirical frequencies quickly approach minimaxoptimality. • In general-sum games, does behavior quickly (or at all) approach a Nash equilibrium? • After all, a Nash Eq is exactly a set of distributions that are no-regret wrt each other. So if the distributions stabilize, they must converge to a Nash equil. • Well, unfortunately, they might not stabilize.

  27. An interesting bad example • [Balcan-Constantin-Mehta12]: • Failure to converge even in Rank-1 games (games where R+C has rank 1). • Interesting because one can find equilibria efficiently in such games.

  28. What can we say? If algorithms minimize “internal” or “swap” regret, then empirical distribution of play approaches correlated equilibrium. • Foster & Vohra, Hart & Mas-Colell,… • Though doesn’t imply play is stabilizing. What are internal/swap regret and correlated equilibria?

  29. More general forms of regret • “best expert” or “external” regret: • Given n strategies. Compete with best of them in hindsight. • “sleeping expert” or “regret with time-intervals”: • Given n strategies, k properties. Let Si be set of days satisfying property i (might overlap). Want to simultaneously achieve low regret over each Si. • “internal” or “swap” regret: like (2), except that Si = set of days in which we chose strategy i.

  30. Internal/swap-regret • E.g., each day we pick one stock to buy shares in. • Don’t want to have regret of the form “every time I bought IBM, I should have bought Microsoft instead”. • Formally, swap regret is wrt optimal function f:{1,…,n}!{1,…,n} such that every time you played action j, it plays f(j).

  31. R P S -1,-1 -1,1 1,-1 1,-1 -1,-1 -1,1 -1,1 1,-1 -1,-1 R P S Weird… why care? “Correlated equilibrium” • Distribution over entries in matrix, such that if a trusted party chooses one at random and tells you your part, you have no incentive to deviate. • E.g., Shapley game. -1,-1 -1,-1 -1,-1 In general-sum games, if all players have low swap-regret, then empirical distribution of play is apx correlated equilibrium.

  32. Connection • If all parties run a low swap regret algorithm, then empirical distribution of play is an apx correlated equilibrium. • Correlator chooses random time t2 {1,2,…,T}. Tells each player to play the action j they played in time t (but does not reveal value of t). • Expected incentive to deviate:jPr(j)(Regret|j) = swap-regret of algorithm • So, this suggests correlated equilibria may be natural things to see in multi-agent systems where individuals are optimizing for themselves

  33. Correlated vs Coarse-correlated Eq In both cases: a distribution over entries in the matrix. Think of a third party choosing from this distr and telling you your part as “advice”. “Correlated equilibrium” • You have no incentive to deviate, even after seeing what the advice is. “Coarse-Correlated equilibrium” • If only choice is to see and follow, or not to see at all, would prefer the former. Low external-regret )apx coarse correlated equilib.

  34. Internal/swap-regret, contd Algorithms for achieving low regret of this form: • Foster & Vohra, Hart & Mas-Colell, Fudenberg & Levine. • Will present method of [BM05] showing how to convert any “best expert” algorithm into one achieving low swap regret. • Unfortunately, #steps to achieve low swap regret is O(n log n) rather than O(log n).

  35. A1 Can convert any “best expert” algorithm A into one achieving low swap regret. Idea: • Instantiate one copy Aj responsible for expected regret over times we play j. Play p = pQ q2 q1 qn Q Alg A2 Cost vector c p2c pnc p1c . . . An • Allows us to view pj as prob we play action j, or as prob we play algAj. • Give Aj feedback of pjc. • Aj guarantees t (pjtct)¢qjt· minitpjtcit+[regret term] • Write as: tpjt(qjt¢ct) · minitpjtcit+[regret term]

  36. A1 Can convert any “best expert” algorithm A into one achieving low swap regret. Idea: • Instantiate one copy Aj responsible for expected regret over times we play j. Play p = pQ q2 q1 qn Q Alg A2 Cost vector c p2c pnc p1c . . . An • Sum over j, get: tptQtct·jminitpjtcit+n[regret term] For each j, can move our prob to its own i=f(j) Our total cost • Write as: tpjt(qjt¢ct) · minitpjtcit+[regret term]

  37. A1 Can convert any “best expert” algorithm A into one achieving low swap regret. Idea: • Instantiate one copy Aj responsible for expected regret over times we play j. Play p = pQ q2 q1 qn Q Alg A2 Cost vector c p2c pnc p1c . . . An • Sum over j, get: tptQtct·jminitpjtcit+n[regret term] For each j, can move our prob to its own i=f(j) Our total cost • Get swap-regret at most n times orig external regret.

  38. Summary (game theory connection) Zero-sum game: two players minimizing external regret minimax optimality General-sum game: players minimizing external regret coarse-correlated equilib. General-sum game: players minimizing internal regret correlated equilib. Algs can be useful for fast approximately optimal solutions to optimization problems.

More Related