Create Presentation
Download Presentation

Download Presentation

Machine Learning for Online Decision Making + applications to Online Pricing

Machine Learning for Online Decision Making + applications to Online Pricing

246 Views

Download Presentation
Download Presentation
## Machine Learning for Online Decision Making + applications to Online Pricing

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Machine Learning for Online Decision Making + applications**to Online Pricing Your guide: Avrim Blum Carnegie Mellon University**Plan for Today**An interesting algorithm for online decision making. Problem of “combining expert advice” Same problem but now with very limited feedback: the “multi-armed bandit problem” Application to online pricing**Using “expert” advice**Say we want to predict the stock market. • We solicit n “experts” for their advice. (Will the market go up or down?) • We then want to use their advice somehow to make our prediction. E.g., Basic question: Is there a strategy that allows us to do nearly as well as best of these in hindsight? [“expert” = someone with an opinion. Not necessarily someone who knows anything.]**Simpler question**• We have n “experts”. • One of these is perfect (never makes a mistake). We just don’t know which one. • Can we find a strategy that makes no more than lg(n) mistakes? • Answer: sure. Just take majority vote over all experts that have been correct so far. • Each mistake cuts # available by factor of 2. • Note: this means ok for n to be very large.**What if no expert is perfect?**One idea: just run above protocol until all experts are crossed off, then repeat. Makes at most log(n) mistakes per mistake of the best expert (plus initial log(n)). Can we do better?**What if no expert is perfect?**Intuition: Making a mistake doesn't completely disqualify an expert. So, instead of crossing off, just lower its weight. Weighted Majority Alg: • Start with all experts having weight 1. • Predict based on weighted majority vote. • Penalize mistakes by cutting weight in half.**Analysis: do nearly as well as best expert in hindsight**• M = # mistakes we've made so far. • m = # mistakes best expert has made so far. • W = total weight (starts at n). • After each mistake, W drops by at least 25%. So, after M mistakes, W is at most n(3/4)M. • Weight of best expert is (1/2)m. So, So, if m is small, then M is pretty small too.**Randomized Weighted Majority**2.4(m + lg n)not so good if the best expert makes a mistake 20% of the time. Can we do better?Yes. • Instead of taking majority vote, use weights as probabilities.(e.g., if 70% on up, 30% on down, then pick 70:30)Idea:smooth out the worst case. • Also, generalize ½ to 1- e. M = expected #mistakes**Analysis**• Say at time twe have fraction Ft of weight on experts that made mistake. • So, we have probability Ft of making a mistake, and we remove an eFt fraction of the total weight. • Wfinal = n(1-e F1)(1 - e F2)... • ln(Wfinal) = ln(n) + åt [ln(1 - e Ft)] · ln(n) - eåt Ft (using ln(1-x) < -x) = ln(n) - e M. (å Ft = E[# mistakes]) • If best expert makes m mistakes, then ln(Wfinal) > ln((1-e)m). • Now solve: ln(n) - e M > m ln(1-e).**Additive regret**• So, have M · OPT + eOPT + 1/e log(n). • Say we know we will play for Ttime steps. Then can set e=(log(n) / T)1/2. Get M · OPT + 2(T * log(n))1/2. • If we don’t know T in advance, can guess and double. • These are called “additive regret” bounds.**Extensions**• What if experts are actions? (rows in a matrix game, choice of deterministic alg to run,…) • At each time t, each has a loss (cost) in {0,1}. • Can still run the algorithm • Rather than viewing as “pick a prediction with prob proportional to its weight” , • View as “pick an expert with probability proportional to its weight” • Same analysis applies.**c**c’ $ A world Extensions • What if losses (costs) in [0,1]? • Here is a simple way to extend the results. • Given cost vector c, view ci as bias of coin. Flip to create boolean vector c’, s.t. E[c’i] = ci. Feed c’ to alg A. • For any sequence of vectors c’, we have: • EA[cost’(A)] · mini cost’(i) + [regret term] • So, E$[EA[cost’(A)]] · E$[mini cost’(i)] + [regret term] • LHS is EA[cost(A)]. • RHS · mini E$[cost’(i)] + [r.t.] = mini[cost(i)] + [r.t.] In other words, costs between 0 and 1 just make the problem easier… Cost’ = cost on c’ vectors**Online pricing**• Say you are selling lemonade (or a cool new software tool, or bottles of water at the world expo). • Protocol #1: for t=1,2,…T • Seller sets price pt • Buyer arrives with valuation vt • If vt¸ pt, buyer purchases and pays pt, else doesn’t. • vt revealed to algorithm. • repeat $2 $5.00 a glass $500 a glass • Protocol #2: same as protocol #1 but without vt revealed. • Assume all valuations in [1,h] • Goal: do nearly as well as best fixed price in hindsight.**Online pricing**• Say you are selling lemonade (or a cool new software tool, or bottles of water at the world expo). • Protocol #1: for t=1,2,…T • Seller sets price pt • Buyer arrives with valuation vt • If vt¸ pt, buyer purchases and pays pt, else doesn’t. • vt revealed to algorithm. • Bad algorithm: “best price in past” • What if sequence of buyers = 1, h, 1, …, 1, h, 1, …, 1, h, … • Alg makes T/h, OPT makes T. • Factor of h worse!**Online pricing**• Say you are selling lemonade (or a cool new software tool, or bottles of water at the world expo). • Protocol #1: for t=1,2,…T • Seller sets price pt • Buyer arrives with valuation vt • If vt¸ pt, buyer purchases and pays pt, else doesn’t. • vt revealed to algorithm. • Good algorithm: Randomized Weighted Majority! • Define one expert for each price p 2 [1,h]. • Best price of this form gives profit OPT. • Run RWM algorithm. Get expected gain at least: • OPT/(1+²) - O(²-1 h log h) #experts = h [extra factor of h coming from range of gains]**Online pricing**• Say you are selling lemonade (or a cool new software tool, or bottles of water at the world expo). • What about Protocol #2? [just see accept/reject decision] • Now we can’t run RWM directly since we don’t know how to penalize the experts! • Called the “adversarial multiarmed bandit problem” • How can we solve that? $2 $5.00 a glass**Multi-armed bandit problem**• [Auer,Cesa-Bianchi,Freund,Schapire] OPT Exponential Weights for Exploration and Exploitation (exp3) RWM n = #experts qt Expert i ~ qt Distrib pt OPT OPT OPT Exp3 $1.25 Gain vector ĝt Gain git ·nh/° qt = (1-°)pt + ° unif ĝt = (0,…,0, git/qit,0,…,0) 1. RWM believes gain is: pt¢ ĝt = pit(git/qit) ´ gtRWM 3. Actual gain is: git = gtRWM (qit/pit) ¸ gtRWM(1-°) 2. t gtRWM¸ /(1+²) - O(²-1 nh/°log n) Because E[ĝjt] = (1- qjt)0 + qjt(gjt/qjt) = gjt , 4. E[ ] ¸OPT. so E[maxj[t ĝjt]] ¸ maxj [ E[t ĝjt] ] = OPT.**Multi-armed bandit problem**• [Auer,Cesa-Bianchi,Freund,Schapire] OPT Exponential Weights for Exploration and Exploitation (exp3) RWM n = #experts qt Expert i ~ qt Distrib pt OPT Exp3 $1.25 Gain vector ĝt Gain git ·nh/° qt = (1-°)pt + ° unif ĝt = (0,…,0, git/qit,0,…,0) Conclusion (° =²): E[Exp3] ¸OPT/(1+²)2 - O(²-2 nh log(n)) Quick improvement: choose expert i to be price (1+²)i . Gives n = log1+²(h), & only hurts OPT by at most (1+²) factor.**Multi-armed bandit problem**• [Auer,Cesa-Bianchi,Freund,Schapire] OPT Exponential Weights for Exploration and Exploitation (exp3) RWM n = #experts qt Expert i ~ qt Distrib pt OPT Exp3 $1.25 Gain vector ĝt Gain git ·nh/° qt = (1-°)pt + ° unif ĝt = (0,…,0, git/qit,0,…,0) Can even reduce ²-2 to ²-1 with more care in analysis. Conclusion (° =² and n = log1+²(h)): E[Exp3] ¸OPT/(1+²)3 - O(²-2 h log(h) loglog(h)) Almost as good as protocol 1!**Summary**Algorithms for online decision-making with strong guarantees on performance compared to best fixed choice. • Application: play repeated game against adversary. Perform nearly as well as fixed strategy in hindsight. Can apply even with very limited feedback. • Application: online pricing, even if only have buy/no buy feedback.