Weakly Coupled Stochastic Decision Systems

Weakly Coupled Stochastic Decision Systems Kamesh Munagala Duke University (joint work with SudiptioGuha, UPenn and Peng Shi, Duke)

Stochastic Decision System Decision Algorithm Guidance Decision Stochastic Model System Model Refinement

Example 1: Multi-armed Bandits

Multi-armed Bandits • n treatments of unknown effectiveness • Model “effectiveness” as probability pi [0,1] • All pi areindependent andunknown a priori

Multi-armed Bandits • n treatments of unknown effectiveness • Model “effectiveness” as probability pi [0,1] • All pi areindependent andunknown a priori • At any step: • Choose a treatment i and test it on a patient

Multi-armed Bandits • n treatments of unknown effectiveness • Model “effectiveness” as probability pi [0,1] • All pi areindependent andunknown a priori • At any step: • Choose a treatment i and test it on a patient • Test either passes/fails and costs ci

Multi-armed Bandits • n treatments of unknown effectiveness • Model “effectiveness” as probability pi [0,1] • All pi areindependent andunknown a priori • At any step: • Choose a treatment iand test it on a patient • Test either passes/fails and costs ci • Repeat until cost budgetTis exhausted

Multi-armed Bandits • n treatments of unknown effectiveness • Model “effectiveness” as probability pi [0,1] • All pi areindependent andunknown a priori • At any step: • Choose a treatment iand test it on a patient • Test either passes/fails and costs ci • Repeat until cost budgetTis exhausted • Choose best treatment/treatments for marketing

Stochastic Decision System Decision Algorithm Which treatment to try next? Guidance Decision Stochastic Model Estimates of pi System ntreatments Model Refinement

Stochastic Model? • Key hurdle for decision maker: • pi are unknown • Stochastic assumption: • pi aredrawn from a known “prior distribution” • Contrast with adversarial assumption: • Assume nothing about pi • Will justify stochastic assumption in a bit…

Example: Beta Prior • pi~ Beta(a,b) • Pr[pi = x] xa-1(1-x)b-1

Example: Beta Prior • pi~ Beta(a,b) • Pr[pi = x] xa-1(1-x)b-1 • Intuition: • Suppose have previously observed (a-1)1’s and (b-1)0’s • Beta(a,b) is posterior distribution given observations • Updated according to Bayes’ rule starting with: • Beta(1,1) = Uniform[0,1] • Expected Reward=E[pi] = a/(a+b)

Stochastic Decision System Decision Algorithm Which treatment to try next? Guidance Decision Stochastic Model pi ~ Beta(ai,bi) System ntreatments Model Refinement

Prior Refined using Bayes’ Rule • If pi = x then next trial passes with probability x • Implies that conditioned on trial passing: • Pr[pi = x] x × xa-1(1-x)b-1 Pr[Success | Prior] Pr[Prior]

Prior Refined using Bayes’ Rule • If pi = x then next trial passes with probability x • Implies that conditioned on trial passing: • Pr[pi = x] x × xa-1(1-x)b-1 xa(1-x)b-1 =Beta(a+1,b)

Prior Refined using Bayes’ Rule • If pi = x then next trial fails with probability 1-x • Implies that conditioned on trial failing: • Pr[pi = x] (1-x) × xa-1(1-x)b-1 xa-1(1-x)b =Beta(a,b +1)

Prior Update for Arm i Beta(1,1) Pr[Reward =1 | Prior] Pr[Reward = 0 | Prior] 1/2 1/2 2,1 1,2 2/3 1/3 1/3 2/3 E[Reward | Prior] = 3/4 3,1 2,2 1,3 1/4 1/2 3/4 3/4 1/2 1/4 4,1 3,2 2,3 1,4

Multi-armed Bandit Lingo System: Multi-armed bandit [Wald ‘47; Arrow et al. ‘49] • Treatment: Bandit arm • Clinical Trial: Playing the arm • Outcome (1/0): Reward

Convenient Abstraction • Posterior density of arm captured by: • Observed rewards from arm so far • Called the “state” of the arm

Convenient Abstraction • Posterior density of arm captured by: • Observed rewards from arm so far • Called the “state” of the arm • State space of a single arm is tractable • Number of states is O(T2) • At mostTplays • Each play yields reward 0 or 1

Stochastic Decision System Decision Algorithm Which treatment to try next? Guidance Decision Stochastic Model pi ~ Beta(ai,bi) System ntreatments Bayes’ Rule

Decision Policy for Playing Arms • Specifies which arm to play next • Function of current states of each arm • Defines a decision tree

Policy: Decision Tree A Play arm A 1 0 pA 1-pA A B 0 1 1 0 Observed Reward A B C 0 1 0 1 Choose arm B B C A

Goal • Find decision policy with maximum value: • Value = E [ Reward of chosen arm ] • What is expectation over?

The Bayesian Objective • Find the policy maximizing expected reward of chosen arm when pidrawn from prior distribution Qi [Wald ‘47, Robbins ‘50, Gittins, Jones ‘72] • Optimize: E[Posterior mean of chosen arm] • Expectation is over paths in the decision tree

The Bayesian Objective • Find the policy maximizing expected reward of chosen arm when pidrawn from prior distribution Qi [Wald ‘47, Robbins ‘50, Gittins, Jones ‘72] • Optimize: E[Posterior mean of chosen arm] • Expectation is over paths in the decision tree • Expectation over two kinds of randomness: • The underlying pidrawn from distribution Qi • The rewards drawn from Bernoulli(1, pi )

The Bayesian Objective • Find the policy maximizing expected reward of chosen arm when pidrawn from prior distribution Qi [Wald ‘47, Robbins ‘50, Gittins, Jones ‘72] • Optimize: E[Posterior mean of chosen arm] • Expectation is over paths in the decision tree • Expectation over two kinds of randomness: • The underlying pidrawn from distribution Qi • The rewards drawn from Bernoulli(1, pi ) • Expected reward of a policy is a unique number • Depends only on known Qibut not on unknown pi

Multi-armed Bandits: Summary • Slot machine (bandit) with narms • Arm = Treatment • When played, arm yields reward • Distribution of reward unknown a priori • Prior specified over possible distributions • Goal: • Design policy for playing arms • Optimize: Scaling factor Reward at step t

Types of Objectives t • Discounted reward: • t = tfor <1 • Finite Horizon: • t = 1 for t T • Budgeted Learning (B-L): • t = 1 for t=T+1 • Solutions are all related to each other t t t T t t T

Weak Coupling [Singh & Cohn ’97; Meuleauet al. ‘98] • Arms are independent: • If played, state evolution of an arm is not conditioned on states of other arms • Playing an arm does not affect states of other arms • Only a few constraints couple arms together in decision policy • T plays over all the arms together • At most one arm chosen finally (in B-L)

Space of Decision Policies

Single Trial (T = 1) Arm 1 ~ Beta(1,2) E[p1] =0.33 Arm 2 ~ Beta(1,3) E[p2] =0.25 a priori better

Single Trial (T = 1) Arm 1 ~ Beta(1,2)E[p1] =0.33 Arm 2 ~ Beta(1,3) E[p2] =0.25 a priori better B(2,2) 1= 0.52=0.25 Y 1/3 Policy 1 (not so good) Arm 1 1= 0.252=0.25 B(1,3) N 2/3 For either outcome, Choose Treatment 1 Effectiveness of finally chosen treatment: Reward = 1/3 0.5 + 2/3 0.25 = 0.33

Single Trial (T = 1) Arm 1 ~ Beta(1,2) E[p1] =0.33 Arm 2 ~ Beta(1,3) E[p2] =0.25 a priori better B(2,2) 1= 0.52=0.25 Y 1/3 Arm 1 1= 0.252=0.25 B(1,3) N 2/3 1= 0.33 2= 0.4 Y 1/4 B(2,3) Choose 2 Arm 2 Policy 2 (optimal) B(1,4) 1= 0.332=0.2 Choose 1 N 3/4 Reward = 1/4  2/5  3/4  1/3 = 0.35 Policy: Play Arm 2 If Y then Choose arm 2 else Choose arm 1

T= 2: Adaptive Solution p1 ~ B(1,1) p2 ~ B(5,2) p3 ~ B(21,11) p1 ~ B(3,1)Choose 1 Y 2/3 p1 ~ B(2,1) Play Arm 1 N 1/3 p1 ~ B(2,2)Choose 2 Y 1/2 Play Arm 1 p2 ~ B(6,2)Choose 2 N 1/2 Y 5/7 p1 ~ B(1,2) Play Arm 2 N 2/7 p2 ~ B(5,3)Choose 3

Curse of Dimensionality [Bellman ‘54] • Policy specifies of action for each “joint state” • Joint state:Cartesian product of current states of arms • Joint state space has size O(T2n)

Curse of Dimensionality [Bellman ‘54] • Policy specifies of action for each “joint state” • Joint state: Cartesian product of current states of arms • Joint state space has size O(T2n) • Dynamic program on state space • Exponential running time and space requirement Approximately optimal poly-size policies?

Our Results • General solution technique: • Works for weakly coupled stochastic systems • Objective needs to be reward maximization • Constant factor approximations • Technique based on: • LP duality • Markov’s inequality

Connection to Existing Heuristics • Gittins and Whittle indexes: • Compute quality measure for each state of each arm • Play arm whose current quality is highest • Exploit weak coupling to separate computation • Greedy algorithm – optimal for discounted reward! • Extremely efficient to compute and execute • Our policies are subtle variants of these indexes • Just as efficient to compute and execute!

Solution Overview (STOC ‘07)

Solution Idea • Consider any decision policy P • Consider its behavior restricted to arm i

Example p1 ~ B(1,1) p2 ~ B(5,2) p3 ~ B(21,11) p1 ~ B(3,1)Choose 1 Y 2/3 p1 ~ B(2,1) Play Arm 1 N 1/3 p1 ~ B(2,2)Choose 2 Y 1/2 Play Arm 1 p2 ~ B(6,2)Choose 2 N 1/2 Y 5/7 p1 ~ B(1,2) Play Arm 2 N 2/7 p2 ~ B(5,3)Choose 3

Behavior Restricted to Arm 2 p1 ~ B(1,1) p2 ~ B(5,2) p3 ~ B(21,11) p1 ~ B(2,1) Play Arm 1 N 1/3 p1 ~ B(2,2)Choose 2 Y 1/2 Play Arm 1 p2 ~ B(6,2)Choose 2 N 1/2 Y 5/7 p1 ~ B(1,2) Play Arm 2

Behavior Restricted to Arm 2 p1 ~ B(1,1) p2 ~ B(5,2) p3 ~ B(21,11) w.p. 1/6 Choose 2 p2 ~ B(6,2)Choose 2 Y 5/7 w.p. 1/2 Play Arm 2 Do nothing N 2/7 With remaining probability, do nothing

Behavior Restricted to Arm i • Yields a randomized policy for arm i • At each state of the arm, policy probabilistically: • Does nothing • Plays the arm • Chooses the arm and obtains posterior reward

Notation • Ti = E[Number of plays made for arm i] • Ci= E[Number of times arm i chosen] • Ri= E[Reward from events when i chosen]

Behavior Restricted to Arm 2 T2 = ½ C2 = 1/6 + ½×5/7 = 11/21 R2 = 1/6 ×5/7 + ½ ×5/7 ×3/4 = 65/168 p2~ B(5,2) w.p. 1/6 Choose 2 p2 ~ B(6,2)Choose 2 Y 5/7 w.p. 1/2 Play Arm 2 Do nothing N 2/7 With remaining probability, do nothing

Weak Coupling • In any decision policy: • Number of plays is at most T • Number of times some arm is chosen is at most 1 • True on all decision paths • Taking expectations over decision paths • ΣiTi ≤ T • ΣiCi ≤ 1 • Value of decision policy = ΣiRi

Relaxed Decision Problem • Find one randomized decision policy Pi for each arm i such that: • ΣiTi (Pi) ≤ T • ΣiCi (Pi) ≤ 1 • Maximize: ΣiRi (Pi) • Why is this a relaxation? • Collection of Pineed not be a feasible policy • Only enforcing coupling in expectation!

Weakly Coupled Stochastic Decision Systems

Weakly Coupled Stochastic Decision Systems

Presentation Transcript

Communication in Tightly Coupled Systems

Coupled systems

Synchronization in Coupled Complex Systems

Stochastic Production Systems

Variational study of weakly coupled triply heavy baryons

Synchronization in Coupled Chaotic Systems

Fault-containment in Weakly Stabilizing Systems

Weakly endochronous systems

What weakly coupled oscillators can tell us about networks and cells

Dephasing and noise in weakly-coupled Bose-Einstein condensates Amichay Vardi

Stochastic Analysis of John Sevier Decision

Stochastic Optimization in Electricity Systems

Synchronization in Coupled Chaotic Systems

Preparation of Weakly Coupled Spins within Molecules as 2qubit Quantum Gates

PRECAST CONCRETE COUPLED WALL SYSTEMS

Stochastic Modeling of Coupled Nephrons

Weakly Coupled Oscillators

Stochastic Planning using Decision Diagrams

Preparation of Weakly Coupled Spins within Molecules as 2qubit Quantum Gates

PRECAST CONCRETE COUPLED WALL SYSTEMS

Synchronization in Coupled Chaotic Systems

New Light Weakly Coupled Particles