1 / 24

The K -armed Dueling Bandits Problem

The K -armed Dueling Bandits Problem. COLT 2009 Cornell University. Multi-armed Bandits. K bandits (arms / actions / strategies) Each time step, algorithm chooses bandit based on prior actions and feedback. Only observe feedback from actions taken. Partial Feedback Online Learning Problem.

fran
Download Presentation

The K -armed Dueling Bandits Problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The K-armed Dueling Bandits Problem COLT 2009 Cornell University

  2. Multi-armed Bandits • K bandits (arms / actions / strategies) • Each time step, algorithm chooses bandit based on prior actions and feedback. • Only observe feedback from actions taken. • Partial Feedback Online Learning Problem

  3. Optimizing Retrieval Functions • Interactively learn the best retrieval function • Search users provide implicit feedback • E.g., what they click on • Seems like a natural bandit problem • Each retrieval function is a bandit • Feedback is clicks • Find the best w.r.t. clickthrough rate • Assumes clicks → explicit absolute feedback!

  4. What Results do Users View/Click? [Joachims et al., TOIS 2007]

  5. Team-Game Interleaving (u=thorsten, q=“svm”) f1(u,q)  r1 f2(u,q)  r2 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machinehttp://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machineshttp://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ Interleaving(r1,r2) 1. Kernel Machines T2http://svm.first.gmd.de/ 2. Support Vector Machine T1http://jbolivar.freeservers.com/ 3. SVM-Light Support Vector Machine T2http://ais.gmd.de/~thorsten/svm light/ 4. An Introduction to Support Vector Machines T1http://www.support-vector.net/ 5. Support Vector Machine and Kernel ... ReferencesT2 http://svm.research.bell-labs.com/SVMrefs.html 6. Archives of SUPPORT-VECTOR-MACHINES ... T1http://www.jiscmail.ac.uk/lists/SUPPORT... 7. Lucent Technologies: SVM demo applet T2http://svm.research.bell-labs.com/SVT/SVMsvt.html • Mix results of f1 and f2 • Relative feedback • More reliable • Interpretation: (r1 >r2) ↔ clicks(T1) > clicks(T2) [Radlinski, Kurup, Joachims; CIKM 2008]

  6. Dueling Bandits Problem • Given K bandits b1, …, bK • Each iteration: compare (duel) two bandits • E.g., interleaving two retrieval functions • Comparison is noisy • Each comparison result independent • Comparison probabilities initially unknown • Comparison probabilities fixed over time • Total preference ordering, initially unknown

  7. Dueling Bandits Problem • Want to find best (or good) bandit • Similar to finding the max w/ noisy comparisons • Ours is a regret minimization setting • Choose pair (bt, bt’) to minimize regret: • (% users who prefer best bandit over chosen ones)

  8. Related Work • Existing MAB settings • [Lai & Robbins, ’85] [Auer et al., ’02] • PAC setting [Even-Dar et al., ’06] • Partial monitoring problem • [Cesa-Bianchi et al., ’06] • Computing with noisy comparisons • Finding max [Feige et al., ’97] • Binary search [Karp & Kleinberg, ’07] [Ben-Or & Hassidim, ’08] • Continuous-armed Dueling Bandits Problem • [Yue & Joachims, ’09]

  9. Assumptions • P(bi > bj) = ½ + εij (distinguishability) • Strong Stochastic Transitivity • For three bandits bi > bj > bk : • Monotonicity property • Stochastic Triangle Inequality • For three bandits bi > bj > bk : • Diminishing returns property • Satisfied by many standard models • E.g., Logistic/Bradley-Terry

  10. Naïve Approach • In deterministic case, O(K) comparisons to find max • Extend to noisy case: • Repeatedly compare until confident one is better • Problem: comparing two awful (but similar) bandits • Waste comparisons to see which awful bandit is better • Incur high regret for each comparison • Also applies to elimination tournaments

  11. Interleaved Filter • Choose candidate bandit at random ►

  12. Interleaved Filter • Choose candidate bandit at random • Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously • Maintain mean and confidence interval for each pair of bandits being compared ►

  13. Interleaved Filter • Choose candidate bandit at random • Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously • Maintain mean and confidence interval for each pair of bandits being compared • …until another bandit is better • With confidence 1 – δ ►

  14. Interleaved Filter • Choose candidate bandit at random • Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously • Maintain mean and confidence interval for each pair of bandits being compared • …until another bandit is better • With confidence 1 – δ • Repeat process with new candidate • Remove all empirically worse bandits ►

  15. Interleaved Filter • Choose candidate bandit at random • Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously • Maintain mean and confidence interval for each pair of bandits being compared • …until another bandit is better • With confidence 1 – δ • Repeat process with new candidate • Remove all empirically worse bandits • Continue until 1 candidate left ►

  16. Regret Analysis • Define a round to be all the time steps for a particular candidate bandit • Halts round when 1 - δ confidence met • Treat candidate bandits as random walk • Takes log K rounds to reach best bandit • Define a match to be all the comparisons between two bandits in a round • O(K) total matches in expectation • Constant fraction of bandits removed each round

  17. Regret Analysis • O(K) total matches • Each match incurs regret • Depends on δ = K-2T-1 • Finds best bandit w.p. 1-1/T • Expected regret:

  18. Lower Bound Example • Order bandits b1 > b2 > … > bK • P(b > b’) = ½ + ε • Each match takes comparisons • Pay Θ(ε) regret for each comparison • Accumulated regret over all matches is at least

  19. Moving Forward • Extensions • Drifting user interests • Changing user population / document collection • Contextualization / side information • Cost-sensitive active learning w/ relative feedback • Live user studies • Interactive experiments on search service • Sheds insight; guides future design

  20. Extra Slides

  21. Per-Match Regret • Number of comparisons in match bi vs bj : • ε1i > εij : round ends before concluding bi > bj • ε1i < εij : conclude bi > bj before round ends, remove bj • Pay ε1i + ε1j regret for each comparison • By triangle inequality ε1i + ε1j ≤ 2*max{ε1i , εij} • Thus by stochastic transitivity accumulated regret is

  22. Number of Rounds • Assume all superior bandits have equal prob of defeating candidate • Worst case scenario under transitivity • Model this as a random walk • rj transitions to each ri (i < j) with equal probability • Compute total number of steps before reaching r1 (i.e., r*) • Can show O(log K) w.h.p. using Chernoff bound

  23. Total Matches Played • O(K) matches played in each round • Naïve analysis yields O(K log K) total • However, all empirically worse bandits are also removed at the end of each round • Will not participate in future rounds • Assume worst case that inferior bandits have ½ chance of being empirically worse • Can show w.h.p. that O(K) total matches are played over O(log N) rounds

  24. Removing Inferior Bandits • At conclusion of each round • Remove any empirically worse bandits • Intuition: • High confidence that winner is better than incumbent candidate • Empirically worse bandits cannot be “much better” than incumbent candidate • Can show via Hoeffding bound that winner is also better than empirically worse bandits with high confidence • Preserves 1-1/T confidence overall that we’ll find the best bandit

More Related