Multi armed bandit problems with dependent arms
This presentation is the property of its rightful owner.
Sponsored Links
1 / 31

Multi-armed Bandit Problems with Dependent Arms PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on
  • Presentation posted in: General

Multi-armed Bandit Problems with Dependent Arms. Sandeep Pandey ([email protected]) Deepayan Chakrabarti ([email protected]) Deepak Agarwal ([email protected]). (unknown reward probabilities). μ 1. μ 2. μ 3. Background: Bandits. Bandit “arms”.

Download Presentation

Multi-armed Bandit Problems with Dependent Arms

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Multi armed bandit problems with dependent arms

Multi-armed Bandit Problems with Dependent Arms

Sandeep Pandey ([email protected])

Deepayan Chakrabarti ([email protected])

Deepak Agarwal ([email protected])


Background bandits

(unknown reward probabilities)

μ1

μ2

μ3

Background: Bandits

Bandit “arms”

  • Pull arms sequentially so as to maximize the total expected reward

  • Show ads on a webpage to maximize clicks

  • Product recommendation to maximize sales


Dependent arms

Dependent Arms

  • Reward probabilities μiare generally assumed to be independent of each other

  • What if they are dependent?

    • E.g., ads on similar topics, using similar text/phrases, should have similar rewards

“Skiing, snowboarding”

“Skiing, snowshoes”

“Get Vonage!”

“Snowshoe rental”

μ1=0.3

μ2=0.28

μ3=10-6

μ2=0.31


Dependent arms1

Dependent Arms

  • Reward probabilities μiare generally assumed to be independent of each other

  • What if they are dependent?

    • E.g., ads on similar topics, using similar text/phrases, should have similar rewards

    • A click on one ad  other “similar” ads may generate clicks as well

    • Can we increase total reward using this dependency?


Cluster model of dependence

Arm 1

Arm 2

Arm 4

Arm 3

# pulls of arm i

Some distribution (known)

Cluster-specific parameter (unknown)

Cluster Model of Dependence

Cluster 1

Cluster 2

μi ~ f(π[i])

Successes si ~ Bin(ni, μi)


Cluster model of dependence1

Arm 1

Arm 2

Arm 4

Arm 3

t=0

T

t=0

Cluster Model of Dependence

μi ~ f(π1)

μi ~ f(π2)

  • Total reward:

  • Discounted:∑ αt.E[R(t)], α = discounting factor

  • Undiscounted:∑ E[R(t)]


Discounted reward

Discounted Reward

Arm 2

x’1 x’2

The optimal policy can be computed using per-cluster MDPs only.

MDP for cluster 1

Pull Arm 1

x1 x2

x”1 x”2

  • Optimal Policy:

  • Compute an (“index”, arm) pair for each cluster

  • Pick the cluster with the largest index, and pull the corresponding arm

Arm 4

x’3 x’4

MDP for cluster 2

Pull Arm 3

x3 x4

x”3 x”4


Discounted reward1

Discounted Reward

Arm 2

x’1 x’2

The optimal policy can be computed using per-cluster MDPs only.

MDP for cluster 1

Pull Arm 1

  • Reduces the problem to smaller state spaces

  • Reduces to Gittins’ Theorem [1979] for independent bandits

  • Approximation bounds on the index for k-step lookahead

x1 x2

x”1 x”2

  • Optimal Policy:

  • Compute an (“index”, arm) pair for each cluster

  • Pick the cluster with the largest index, and pull the corresponding arm

Arm 4

x’3 x’4

MDP for cluster 2

Pull Arm 3

x3 x4

x”3 x”4


Cluster model of dependence2

Arm 1

Arm 2

Arm 4

Arm 3

Cluster Model of Dependence

μi ~ f(π1)

μi ~ f(π2)

  • Total reward:

  • Discounted:∑ αt.E[R(t)], α = discounting factor

  • Undiscounted:∑ E[R(t)]

t=0

T

t=0


Undiscounted reward

Arm 1

Arm 2

Arm 4

Arm 3

Undiscounted Reward

“Cluster arm” 1

“Cluster arm” 2

All arms in a cluster are similar  They can be grouped into one hypothetical “cluster arm”


Undiscounted reward1

Arm 1

Arm 2

Arm 4

Arm 3

Undiscounted Reward

  • Two-Level Policy

    In each iteration:

  • Pick “cluster arm” using a traditional bandit policy

  • Pick an arm within that cluster using a traditional bandit policy

“Cluster arm” 1

“Cluster arm” 2

Each “cluster arm” must have some estimated reward probability


Issues

Issues

  • What is the reward probability of a “cluster arm”?

  • How do cluster characteristics affect performance?


Reward probability of a cluster arm

Reward probability of a “cluster arm”

  • What is the reward probability r of a “cluster arm”?

  • MEAN: r = ∑si / ∑ni, i.e., average success rate, summing over all arms in the cluster [Kocsis+/2006, Pandey+/2007]

    • Initially, r = μavg = average μ of arms in cluster

    • Finally, r = μmax = max μ among arms in cluster

    • “Drift” in the reward probability of the “cluster arm”


Reward probability drift causes problems

Arm 1

Arm 2

Arm 4

Arm 3

Reward probability drift causes problems

  • Drift  Non-optimal clusters might temporarily look better  optimal arm is explored only O(log T) times

Best (optimal) arm, with reward probability μopt

Cluster 1

Cluster 2

(opt cluster)


Reward probability of a cluster arm1

Reward probability of a “cluster arm”

  • What is the reward probability r of a “cluster arm”?

  • MEAN:r = ∑si / ∑ni

  • MAX:r =max( E[μi] )

  • PMAX:r =E[max(μi) ]

  • Both MAX and PMAX aim to estimate μmax and thus reduce drift

for all arms i in cluster


Reward probability of a cluster arm2

Reward probability of a “cluster arm”

Bias in estimation of μmax

  • MEAN:r = ∑si / ∑ni

  • MAX:r =max( E[μi] )

  • PMAX:r =E[max(μi) ]

  • Both MAX and PMAX aim to estimate μmax and thus reduce drift

Variance of estimator

High

Unbiased

Low

High


Comparison of schemes

Comparison of schemes

10 clusters, 11.3 arms/cluster

MAX performs best


Issues1

Issues

  • What is the reward probability of a “cluster arm”?

  • How do cluster characteristics affect performance?


Effects of cluster characteristics

Effects of cluster characteristics

  • We analytically study the effects of cluster characteristics on the “crossover-time”

    • Crossover-time: Time when the expected reward probability of the optimal cluster becomes highest among all “cluster arms”


Effects of cluster characteristics1

Effects of cluster characteristics

  • Crossover-time Tc for MEAN depends on:

  • Cluster separation Δ = μopt – μmax outside opt clusterΔ increases  Tc decreases

  • Cluster size Aopt Aopt increases  Tc increases

  • Cohesiveness in opt cluster 1-avg(μopt – μi) Cohesiveness increases  Tc decreases


Experiments effect of separation

Experiments (effect of separation)

Δ increases  Tc decreases  higher reward


Experiments effect of size

Experiments (effect of size)

Aopt increases  Tc increases  lower reward


Experiments effect of cohesiveness

Experiments (effect of cohesiveness)

Cohesiveness increases  Tc decreases  higher reward


Related work

Related Work

  • Typical multi-armed bandit problems

    • Do not consider dependencies

    • Very few arms

  • Bandits with side information

    • Cannot handle dependencies among arms

  • Active learning

    • Emphasis on #examples required to achieve a given prediction accuracy


Conclusions

Conclusions

  • We analyze bandits where dependencies are encapsulated within clusters

  • Discounted Reward the optimal policy is an index scheme on the clusters

  • Undiscounted Reward

    • Two-level Policy with MEAN, MAX, and PMAX

    • Analysis of the effect of cluster characteristics on performance, for MEAN


Discounted reward2

x’1 x’2

Pull Arm 3

Pull Arm 2

Pull Arm 4

success

x3 x4

Change of belief for both arms 1 and 2

Pull Arm 1

Estimated reward probabilities

x”1 x”2

failure

x3 x4

Discounted Reward

1

3

4

2

x1 x2

x3 x4

  • Create a belief-state MDP

  • Each state contains the estimated reward probabilities for all arms

  • Solve for optimal


Background bandits1

(unknown payoff probabilities)

p1

p2

p3

Background: Bandits

Bandit “arms”

Regret = optimal payoff – actual payoff


Reward probability of a cluster arm3

Reward probability of a “cluster arm”

  • What is the reward probability of a “cluster arm”?

    • Eventually, every “cluster arm” must converge to the most rewarding arm μmaxwithin that cluster

    • since a bandit policy is used within each cluster

    • However, “drift” causes problems


Experiments

Experiments

  • Simulation based on one week’s worth of data from a large-scale ad-matching application

  • 10 clusters, with 11.3 arms/cluster on average


Comparison of schemes1

Comparison of schemes

10 clusters, 11.3 arms/cluster

  • Cluster separation Δ = 0.08

  • Cluster size Aopt = 31

  • Cohesiveness = 0.75

MAX performs best


Reward probability drift causes problems1

Arm 1

Arm 2

Arm 4

Arm 3

Reward probability drift causes problems

Intuitively, to reduce regret, we must:

  • Quickly converge to the optimal “cluster arm”

  • and then to the best arm within that cluster

Best (optimal) arm, with reward probability μopt

Cluster 1

Cluster 2

(opt cluster)


  • Login