Connections between Learning Theory, Game Theory, and Optimization

Connections between Learning Theory, Game Theory, and Optimization Lecture 1, August 24th2010 Maria Florina (Nina) Balcan

Big Picture Over the past decades, many important and deep connections between: • machine learning theory • algorithmic game theory • combinatorial optimization We will explore such connections, discussing: • fundamental topics in each area. • how ideas from each area can shed light on the others.

Outline Online learning. Combining expert advice. Regret minimization (no external regret and no internal regret). Bandit algorithms. 1/2 1 0 Zero sum games. Nash equilibria. 0 1/2 1 Experts learning & Minimax theorem. 1 0 1/2 Nash equilibria and approximate nash equilibria in general sum bimatrix games.

+ + + - + - - - - Outline Learning in a distributional setting. Sample complexity results. Weak-learning vs. Strong-learning. Boosting with connections to game theory. Quality of equilibria (Price of anarchy/stability). Games with many players. Potential games. Dynamics in games and the price of learning.

Outline Mechanism design (MD). Combinatorial auctions. [Social welfare; revenue maximization] Auctions for digital goods. • Reductions from MD to algorithm design using machine learning. Algorithmic pricing problems. • Online learning for designing online pricing schemes.

Outline Submodularity with connections to game theory and machine learning. • Combinatorial auctions with submodular valuations • Learning submodular functions • Other optimization pbs involving submodularity (ranking, clustering, etc.)

Admin http://www.cc.gatech.edu/~ninamf/LGO10/ • Course web page: • 3 hwk assignments. Exercises/problems (pencil-and-paper problem-solving variety). • Project: explore a theoretical question, try some experiments, or read a couple of papers and explain the idea. Writeup and class presentation. Groups ok. [50%] [50%] • “Algorithmic Game Theory”, Nisan, Roughgarden, Tardos, Vazirani • Other papers, surveys, and tutorials

Online learning, minimizing regret, and combining expert advice. • “The weighted majority algorithm” N. Littlestone & M. Warmuth • “Online Algorithms in Machine Learning” (survey) A. Blum • Algorithmic Game Theory, Nisan, Roughgarden, Tardos, Vazirani (eds) [Chapters 4] • Prediction, Learning, and Games, Cesa-Bianchi, Lugosi

Online learning, minimizing regret, and combining expert advice. Expert 3 Expert 2 Expert 1

Using “expert” advice Assume we want to predict the stock market. • Will the market go up or down? • We solicit n “experts” for their advice. • We then want to use their advice somehow to make our prediction. E.g., Can we do nearly as well as best in hindsight? Note: “expert” ´ someone with an opinion. [Not necessairly someone who knows anything.]

Formal model • For each round t=1,2, …, T • There are n experts. • Each expert makes a prediction in {0,1} • The learner (using experts’ predictions) makes a prediction in {0,1} • The learner observes the actual outcome. There is a mistake if the predicted outcome is different form the actual outcome. Can we do nearly as well as best in hindsight?

Weighted Majority Algorithm Deterministic Majority Algorithm • Start with all experts having weight 1. • Predict based on weighted majority vote. • If • then predict 1 • else predict 0 • Penalize mistakes by cutting weight in half. Randomized versions of this algorithm can provide surprisingly strong guarantees

Weighted Majority Algorithm • E[# mistakes] ·(1+e)OPT + e-1log(n). • If set =(log(n)/OPT)1/2 to balance the two terms out (or use guess-and-double), get bound of • E[mistakes]·OPT+2(OPT¢log n)1/2 Note: Of course we might not know OPT, so if running T time steps, since OPT · T, set ² to get additive loss (2T log n)1/2 regret • E[mistakes]·OPT+2(T¢log n)1/2 • So, regret/T ! 0. [no regret algorithm]

Many other useful extensions E.g., what if have n options, not n predictors? • We’re not combining n experts, we’re choosing one. • Nice feature of RWM: can be applied when experts are n different options • E.g., n different ways to drive to work each day, n different ways to invest our money. Other generalizations as well. Other notions of no regret (e.g., no internal regret).

Online Learning, Game Theory, and Minimax Optimality “Game Theory, On-line Prediction, and Boosting”, Freund & Schapire, GEB

Zero Sum Games Game defined by a matrix M. Assume wlog entries in [0,1]. Scissors Rock Paper 1/2 1 0 Rock Paper 0 1/2 1 Row player (Mindy) chooses row i. Scissors 1 0 1/2 Column player (Max) chooses column j (simultaneously). Mindy’s goal: minimize her loss M(i,j). Max’s goal: maximize this loss (zero sum).

Randomized Play Mindy chooses a distribution P over rows. Max chooses a distribution Q over columns [simultaneously] Mindy’s expected loss: If i,j = pure strategies, and P,Q = mixed strategies M(P,j) - Mindy’s expected loss when she plays P and Max plays j M(i,Q) - Mindy’s expected loss when she plays i and Max plays Q

Sequential Play Say Mindy plays before Max. If Mindy chooses P, then Max will pick Q to maximize M(P,Q), so the loss will be So, Mindy should pick P to minimize L(P). Loss will be: Similarly, if Max plays first, loss will be:

Minimax Theorem Playing second cannot be worse than playing first Mindy plays first Mindy plays second Von Neumann’s minimax theorem: No advantage to playing second!

Optimal Play Von Neumann’s minimax theorem: Value of the game Optimal strategies: Min-max strategy Max-min strategy We will show how to use WM to prove this! And to also find approximate min-max strategies quickly.

Optimal Play Von Neumann’s minimax theorem: Value of the game Optimal strategies: Min-max strategy Max-min strategy (P*, Q*) is Nash Equilibria (No player has an incentive to unilateraly deviate) Central solution concept we will study

Games with many players with interesting structure "Potential Games", D. Monderer and L, S. Shapley , Games and Economic Behavior

Fair cost-sharing Fair cost-sharing: n players in weighted directed graph G. Player i wants to get from si to ti, and they share cost of edges they use with others. G

Fair cost-sharing s n 1 t • n players in directed graph G, each edge e costsce. • Player i wants to get fromsito ti. • All players share cost of edges they use with others. • Each player wants to minimize his own cost. Good equilibrium: all use edge of cost 1. (paying 1/n each) Bad equilibrium: all use edge of cost n. (paying 1 each)

Inefficiency of equilibria, PoA and PoS Price of Anarchy (PoA): ratio of worst Nash equilibrium to OPT. [Koutsoupias-Papadimitriou’99] Price of Stability (PoS): ratio of best Nash equilibrium to OPT. [Anshelevich et. al, 2004] E.g., for fair cost-sharing, PoS is log(n), whereas PoA is n. Significant effort spent on understanding these in CS. “Algorithmic Game Theory”, Nisan, Roughgarden, Tardos, Vazirani

Congestion games • Nice general class of games with many players. • Always have a pure-strategy equilibrium. • Have a potential functions.t. whenever a player switches, potential drops by exactly that player’s improvement. • We will analyze dynamics in these games!!! • What happens if players follow natural learning dynamics!!!

Learning in a distributional setting. [With feature information]

Used all over CS and Science Image Classification Document Categorization Speech Recognition Protein Classification Spam Detection Branch Prediction Fraud Detection

Example: Supervised Classification Decide which emails are spam and which are important. Supervised classification Not spam spam Goal: use emails seen so far to produce good prediction rule for future data.

+ + - + - + - - - - Example: Supervised Classification Represent each message by features. (e.g., keywords, spelling, etc.) example label Reasonable RULES: Predict SPAM if unknown AND (money OR pills) Predict SPAM if 2money + 3pills –5 known > 0 Linearly separable

Two Main Aspects of Supervised Learning Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. Optimization played a significant role in the recent years. Confidence Bounds, Generalization Guarantees, Sample Complexity Confidence for rule effectiveness on future data. Well understood for passive supervised learning.

Standard Passive Supervised Learning • S={(x, l)} - set of labeled examples • X – feature space • drawn i.i.d. from distr. D over X and labeled by target concept c* • Do optimization over S, find hypothesis h 2 C. • Goal: h has small error over D. • err(h)=Prx 2 D(h(x)  c*(x)) c* h • c* in C, realizable case; else agnostic

Standard Passive Supervised Learning Classic models: PAC (Valiant), SLT (Vapnik) • Sample Complexity, Finite Hypothesis Spaces, Realizable Case • In in the non-realizable case, replace \epsilon with \epsilon ^2.

Standard Passive Supervised Learning Classic models: PAC (Valiant), SLT (Vapnik) • Sample Complexity, Finite Hypothesis Spaces, Realizable Case • Such ideas/techniques useful in Auction design, Learning submodular functions, etc.

Boosting & game theory • Suppose I have an algorithm A that for any distribution (weighting fn) over a dataset S can produce a rule h2H that gets < 40% error. • Adaboost gives a way to use such an A to get error ! 0 at a good rate, using weighted votes of rules produced. • We can show that this is in principle possible by using the minimax theorem!

Supermarket Pricing Problem • A supermarket trying to decide on how to price the goods. Seller’s Goal: set prices to maximize revenue. • Simple case: customers make separate decisions on each item. • Harder case: customers buy everything or nothing based on • sum of prices in list. • Or could be even more complex.

Supermarket Pricing Problem Algorithmic • Seller knows the market well. Incentive Compatible Auction • Must be in customers’ interest (dominant strategy) to report truthfully. Online Pricing • Customers arrive one at a time, buy what they want at current prices. Seller modifies prices over time. • Techniques from learning will be useful here.

Submodular functions V={1,2, …, n}, f : 2V!R Submodularity: • Concave Functions Let h : R!R be concave.For each S µ V, let f(S) = h(|S|) f(S)+f(T) ¸ f(S Å T) + f(S [ T) 8 S,Tµ V Equivalent Decreasing marginal values: f(S [ {x})-f(S) ¸ f(T [ {x})-f(T) 8SµTµV, xT Examples: • Vector Spaces Let V={v1,,vn}, each vi2Rn.For each S µ V, let f(S) = rank(V[S])

Submodular functions • Strong connection between optimization and submodularity • e.g.: minimization [C’85,GLS’87,IFF’01,S’00,…],maximization [NWF’78,V’07,…] • Algorithmic game theory • Submodular utility functions • Much interest in Machine Learning community recently • Tutorials at major conferences: ICML, NIPS, etc. • www.submodularity.org is a Machine Learning site • Interesting to understand their learnability

Connections between Learning Theory, Game Theory, and Optimization