- 103 Views
- Uploaded on
- Presentation posted in: General

Multi-armed Bandit Problems with Dependent Arms

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Multi-armed Bandit Problems with Dependent Arms

Sandeep Pandey (spandey@cs.cmu.edu)

Deepayan Chakrabarti (deepay@yahoo-inc.com)

Deepak Agarwal (dagarwal@yahoo-inc.com)

(unknown reward probabilities)

μ1

μ2

μ3

Bandit “arms”

- Pull arms sequentially so as to maximize the total expected reward
- Show ads on a webpage to maximize clicks
- Product recommendation to maximize sales

- Reward probabilities μiare generally assumed to be independent of each other
- What if they are dependent?
- E.g., ads on similar topics, using similar text/phrases, should have similar rewards

“Skiing, snowboarding”

“Skiing, snowshoes”

“Get Vonage!”

“Snowshoe rental”

μ1=0.3

μ2=0.28

μ3=10-6

μ2=0.31

- Reward probabilities μiare generally assumed to be independent of each other
- What if they are dependent?
- E.g., ads on similar topics, using similar text/phrases, should have similar rewards
- A click on one ad other “similar” ads may generate clicks as well
- Can we increase total reward using this dependency?

Arm 1

Arm 2

Arm 4

Arm 3

# pulls of arm i

Some distribution (known)

Cluster-specific parameter (unknown)

Cluster 1

Cluster 2

μi ~ f(π[i])

Successes si ~ Bin(ni, μi)

Arm 1

Arm 2

Arm 4

Arm 3

∞

t=0

T

t=0

μi ~ f(π1)

μi ~ f(π2)

- Total reward:
- Discounted:∑ αt.E[R(t)], α = discounting factor
- Undiscounted:∑ E[R(t)]

Arm 2

x’1 x’2

The optimal policy can be computed using per-cluster MDPs only.

MDP for cluster 1

Pull Arm 1

x1 x2

x”1 x”2

- Optimal Policy:
- Compute an (“index”, arm) pair for each cluster
- Pick the cluster with the largest index, and pull the corresponding arm

Arm 4

x’3 x’4

MDP for cluster 2

Pull Arm 3

x3 x4

x”3 x”4

Arm 2

x’1 x’2

The optimal policy can be computed using per-cluster MDPs only.

MDP for cluster 1

Pull Arm 1

- Reduces the problem to smaller state spaces
- Reduces to Gittins’ Theorem [1979] for independent bandits
- Approximation bounds on the index for k-step lookahead

x1 x2

x”1 x”2

- Optimal Policy:
- Compute an (“index”, arm) pair for each cluster
- Pick the cluster with the largest index, and pull the corresponding arm

Arm 4

x’3 x’4

MDP for cluster 2

Pull Arm 3

x3 x4

x”3 x”4

Arm 1

Arm 2

Arm 4

Arm 3

μi ~ f(π1)

μi ~ f(π2)

- Total reward:
- Discounted:∑ αt.E[R(t)], α = discounting factor
- Undiscounted:∑ E[R(t)]

∞

t=0

T

t=0

Arm 1

Arm 2

Arm 4

Arm 3

“Cluster arm” 1

“Cluster arm” 2

All arms in a cluster are similar They can be grouped into one hypothetical “cluster arm”

Arm 1

Arm 2

Arm 4

Arm 3

- Two-Level Policy
In each iteration:

- Pick “cluster arm” using a traditional bandit policy
- Pick an arm within that cluster using a traditional bandit policy

“Cluster arm” 1

“Cluster arm” 2

Each “cluster arm” must have some estimated reward probability

- What is the reward probability of a “cluster arm”?
- How do cluster characteristics affect performance?

- What is the reward probability r of a “cluster arm”?
- MEAN: r = ∑si / ∑ni, i.e., average success rate, summing over all arms in the cluster [Kocsis+/2006, Pandey+/2007]
- Initially, r = μavg = average μ of arms in cluster
- Finally, r = μmax = max μ among arms in cluster
- “Drift” in the reward probability of the “cluster arm”

Arm 1

Arm 2

Arm 4

Arm 3

- Drift Non-optimal clusters might temporarily look better optimal arm is explored only O(log T) times

Best (optimal) arm, with reward probability μopt

Cluster 1

Cluster 2

(opt cluster)

- What is the reward probability r of a “cluster arm”?
- MEAN:r = ∑si / ∑ni
- MAX:r =max( E[μi] )
- PMAX:r =E[max(μi) ]
- Both MAX and PMAX aim to estimate μmax and thus reduce drift

for all arms i in cluster

Bias in estimation of μmax

- MEAN:r = ∑si / ∑ni
- MAX:r =max( E[μi] )
- PMAX:r =E[max(μi) ]
- Both MAX and PMAX aim to estimate μmax and thus reduce drift

Variance of estimator

High

Unbiased

Low

High

10 clusters, 11.3 arms/cluster

MAX performs best

- What is the reward probability of a “cluster arm”?
- How do cluster characteristics affect performance?

- We analytically study the effects of cluster characteristics on the “crossover-time”
- Crossover-time: Time when the expected reward probability of the optimal cluster becomes highest among all “cluster arms”

- Crossover-time Tc for MEAN depends on:
- Cluster separation Δ = μopt – μmax outside opt clusterΔ increases Tc decreases
- Cluster size Aopt Aopt increases Tc increases
- Cohesiveness in opt cluster 1-avg(μopt – μi) Cohesiveness increases Tc decreases

Δ increases Tc decreases higher reward

Aopt increases Tc increases lower reward

Cohesiveness increases Tc decreases higher reward

- Typical multi-armed bandit problems
- Do not consider dependencies
- Very few arms

- Bandits with side information
- Cannot handle dependencies among arms

- Active learning
- Emphasis on #examples required to achieve a given prediction accuracy

- We analyze bandits where dependencies are encapsulated within clusters
- Discounted Reward the optimal policy is an index scheme on the clusters
- Undiscounted Reward
- Two-level Policy with MEAN, MAX, and PMAX
- Analysis of the effect of cluster characteristics on performance, for MEAN

x’1 x’2

Pull Arm 3

Pull Arm 2

Pull Arm 4

success

x3 x4

Change of belief for both arms 1 and 2

Pull Arm 1

Estimated reward probabilities

x”1 x”2

failure

x3 x4

1

3

4

2

x1 x2

x3 x4

- Create a belief-state MDP
- Each state contains the estimated reward probabilities for all arms
- Solve for optimal

(unknown payoff probabilities)

p1

p2

p3

Bandit “arms”

Regret = optimal payoff – actual payoff

- What is the reward probability of a “cluster arm”?
- Eventually, every “cluster arm” must converge to the most rewarding arm μmaxwithin that cluster
- since a bandit policy is used within each cluster
- However, “drift” causes problems

- Simulation based on one week’s worth of data from a large-scale ad-matching application
- 10 clusters, with 11.3 arms/cluster on average

10 clusters, 11.3 arms/cluster

- Cluster separation Δ = 0.08
- Cluster size Aopt = 31
- Cohesiveness = 0.75

MAX performs best

Arm 1

Arm 2

Arm 4

Arm 3

Intuitively, to reduce regret, we must:

- Quickly converge to the optimal “cluster arm”
- and then to the best arm within that cluster

Best (optimal) arm, with reward probability μopt

Cluster 1

Cluster 2

(opt cluster)