1 / 31

# Multi-armed Bandit Problems with Dependent Arms - PowerPoint PPT Presentation

Multi-armed Bandit Problems with Dependent Arms. Sandeep Pandey ([email protected]) Deepayan Chakrabarti ([email protected]) Deepak Agarwal ([email protected]). (unknown reward probabilities). μ 1. μ 2. μ 3. Background: Bandits. Bandit “arms”.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Multi-armed Bandit Problems with Dependent Arms' - lefty

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Multi-armed Bandit Problems with Dependent Arms

Sandeep Pandey ([email protected])

Deepayan Chakrabarti ([email protected])

Deepak Agarwal ([email protected])

μ1

μ2

μ3

Background: Bandits

Bandit “arms”

• Pull arms sequentially so as to maximize the total expected reward

• Show ads on a webpage to maximize clicks

• Product recommendation to maximize sales

• Reward probabilities μiare generally assumed to be independent of each other

• What if they are dependent?

• E.g., ads on similar topics, using similar text/phrases, should have similar rewards

“Skiing, snowboarding”

“Skiing, snowshoes”

“Get Vonage!”

“Snowshoe rental”

μ1=0.3

μ2=0.28

μ3=10-6

μ2=0.31

• Reward probabilities μiare generally assumed to be independent of each other

• What if they are dependent?

• E.g., ads on similar topics, using similar text/phrases, should have similar rewards

• A click on one ad  other “similar” ads may generate clicks as well

• Can we increase total reward using this dependency?

Arm 2

Arm 4

Arm 3

# pulls of arm i

Some distribution (known)

Cluster-specific parameter (unknown)

Cluster Model of Dependence

Cluster 1

Cluster 2

μi ~ f(π[i])

Successes si ~ Bin(ni, μi)

Arm 2

Arm 4

Arm 3

t=0

T

t=0

Cluster Model of Dependence

μi ~ f(π1)

μi ~ f(π2)

• Total reward:

• Discounted:∑ αt.E[R(t)], α = discounting factor

• Undiscounted:∑ E[R(t)]

Arm 2

x’1 x’2

The optimal policy can be computed using per-cluster MDPs only.

MDP for cluster 1

Pull Arm 1

x1 x2

x”1 x”2

• Optimal Policy:

• Compute an (“index”, arm) pair for each cluster

• Pick the cluster with the largest index, and pull the corresponding arm

Arm 4

x’3 x’4

MDP for cluster 2

Pull Arm 3

x3 x4

x”3 x”4

Arm 2

x’1 x’2

The optimal policy can be computed using per-cluster MDPs only.

MDP for cluster 1

Pull Arm 1

• Reduces the problem to smaller state spaces

• Reduces to Gittins’ Theorem [1979] for independent bandits

• Approximation bounds on the index for k-step lookahead

x1 x2

x”1 x”2

• Optimal Policy:

• Compute an (“index”, arm) pair for each cluster

• Pick the cluster with the largest index, and pull the corresponding arm

Arm 4

x’3 x’4

MDP for cluster 2

Pull Arm 3

x3 x4

x”3 x”4

Arm 2

Arm 4

Arm 3

Cluster Model of Dependence

μi ~ f(π1)

μi ~ f(π2)

• Total reward:

• Discounted:∑ αt.E[R(t)], α = discounting factor

• Undiscounted:∑ E[R(t)]

t=0

T

t=0

Arm 2

Arm 4

Arm 3

Undiscounted Reward

“Cluster arm” 1

“Cluster arm” 2

All arms in a cluster are similar  They can be grouped into one hypothetical “cluster arm”

Arm 2

Arm 4

Arm 3

Undiscounted Reward

• Two-Level Policy

In each iteration:

• Pick “cluster arm” using a traditional bandit policy

• Pick an arm within that cluster using a traditional bandit policy

“Cluster arm” 1

“Cluster arm” 2

Each “cluster arm” must have some estimated reward probability

• What is the reward probability of a “cluster arm”?

• How do cluster characteristics affect performance?

• What is the reward probability r of a “cluster arm”?

• MEAN: r = ∑si / ∑ni, i.e., average success rate, summing over all arms in the cluster [Kocsis+/2006, Pandey+/2007]

• Initially, r = μavg = average μ of arms in cluster

• Finally, r = μmax = max μ among arms in cluster

• “Drift” in the reward probability of the “cluster arm”

Arm 2

Arm 4

Arm 3

Reward probability drift causes problems

• Drift  Non-optimal clusters might temporarily look better  optimal arm is explored only O(log T) times

Best (optimal) arm, with reward probability μopt

Cluster 1

Cluster 2

(opt cluster)

• What is the reward probability r of a “cluster arm”?

• MEAN:r = ∑si / ∑ni

• MAX:r =max( E[μi] )

• PMAX:r =E[max(μi) ]

• Both MAX and PMAX aim to estimate μmax and thus reduce drift

for all arms i in cluster

Bias in estimation of μmax

• MEAN:r = ∑si / ∑ni

• MAX:r =max( E[μi] )

• PMAX:r =E[max(μi) ]

• Both MAX and PMAX aim to estimate μmax and thus reduce drift

Variance of estimator

High

Unbiased

Low

High

10 clusters, 11.3 arms/cluster

MAX performs best

• What is the reward probability of a “cluster arm”?

• How do cluster characteristics affect performance?

• We analytically study the effects of cluster characteristics on the “crossover-time”

• Crossover-time: Time when the expected reward probability of the optimal cluster becomes highest among all “cluster arms”

• Crossover-time Tc for MEAN depends on:

• Cluster separation Δ = μopt – μmax outside opt clusterΔ increases  Tc decreases

• Cluster size Aopt Aopt increases  Tc increases

• Cohesiveness in opt cluster 1-avg(μopt – μi) Cohesiveness increases  Tc decreases

Δ increases  Tc decreases  higher reward

Aopt increases  Tc increases  lower reward

Cohesiveness increases  Tc decreases  higher reward

• Typical multi-armed bandit problems

• Do not consider dependencies

• Very few arms

• Bandits with side information

• Cannot handle dependencies among arms

• Active learning

• Emphasis on #examples required to achieve a given prediction accuracy

• We analyze bandits where dependencies are encapsulated within clusters

• Discounted Reward the optimal policy is an index scheme on the clusters

• Undiscounted Reward

• Two-level Policy with MEAN, MAX, and PMAX

• Analysis of the effect of cluster characteristics on performance, for MEAN

x’1 x’2

Pull Arm 3

Pull Arm 2

Pull Arm 4

success

x3 x4

Change of belief for both arms 1 and 2

Pull Arm 1

Estimated reward probabilities

x”1 x”2

failure

x3 x4

Discounted Reward

1

3

4

2

x1 x2

x3 x4

• Create a belief-state MDP

• Each state contains the estimated reward probabilities for all arms

• Solve for optimal

p1

p2

p3

Background: Bandits

Bandit “arms”

Regret = optimal payoff – actual payoff

• What is the reward probability of a “cluster arm”?

• Eventually, every “cluster arm” must converge to the most rewarding arm μmaxwithin that cluster

• since a bandit policy is used within each cluster

• However, “drift” causes problems

• Simulation based on one week’s worth of data from a large-scale ad-matching application

• 10 clusters, with 11.3 arms/cluster on average

10 clusters, 11.3 arms/cluster

• Cluster separation Δ = 0.08

• Cluster size Aopt = 31

• Cohesiveness = 0.75

MAX performs best

Arm 2

Arm 4

Arm 3

Reward probability drift causes problems

Intuitively, to reduce regret, we must:

• Quickly converge to the optimal “cluster arm”

• and then to the best arm within that cluster

Best (optimal) arm, with reward probability μopt

Cluster 1

Cluster 2

(opt cluster)