Online Sampling for Markov Decision Processes

Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Online SamplingforMarkov Decision Processes Electrical and Computer Engineering Purdue University

Markov Decision Process (MDP) • Ingredients: • System state x in state space X • Control action a in A(x) • Reward R(x,a) • State-transition probability P(x,y,a) • Find control policy to maximize objective fun Bob Givan Electrical and Computer Engineering Purdue University

Optimal Policies • Policy – mapping from state and time to actions • Stationary Policy – mapping from state to actions • Goal – a policy maximizing the objective function VH*(x0) = max Obj [R(x0,a0), …, R(xH-1,aH-1)] wherethe “max” is over all policies u = u0,…,uH-1 • For large H, a0 independent of H. (w/ergodicity assum.) • Stationary optimal action a0for H =  via receding horizon control Bob Givan Electrical and Computer Engineering Purdue University

Q Values Fix a large H, focus on finite-horizon reward • Define Q(x,a) = R(x,a) + E[VH-1*(y)] • “Utility” of action a at state x. • Name: Q-value of action a at state x. • Key identities (Bellman’s equations): • VH*(x) = maxaQ(x,a) • 0*(x) = argmaxaQ(x,a) Bob Givan Electrical and Computer Engineering Purdue University

Solution Methods • Recall: • u0*(x) = argmaxaQ(x,a) • Q(x,a) =R(x,a) + E [VH-1*(y)] • Problem: Q-value depends on optimal policy. • State space is extremely large (often continuous) • Two-pronged solution approach: • Apply a receding-horizon method • Estimate Q-values via simulation/sampling Bob Givan Electrical and Computer Engineering Purdue University

Methods for Q-value Estimation Previous work by other authors: • Unbiased sampling (exact Q value)[Kearns et al., IJCAI-99] • Policy rollout (lower bound)[Bertsekas & Castanon, 1999] Our techniques: • Hindsight optimization (upper bound) • Parallel rollout (lower bound) Bob Givan Electrical and Computer Engineering Purdue University

Expectimax Tree for V* Bob Givan Electrical and Computer Engineering Purdue University

Unbiased Sampling Bob Givan Electrical and Computer Engineering Purdue University

Unbiased Sampling (Cont’d) • For a given desired accuracy, how largeshould sampling width and depth be? • Answered: Kearns, Mansour, and Ng (1999) • Requires prohibitive sampling width and depth • e.g. C 108, Hs > 60 to distinguish “best” and “worst” policies in our scheduling domain • We evaluate with smaller width and depth Bob Givan Electrical and Computer Engineering Purdue University

How to Look Deeper? Bob Givan Electrical and Computer Engineering Purdue University

Policy Roll-out Bob Givan Electrical and Computer Engineering Purdue University

Policy Rollout in Equations • Write VHu(y) for the value of following policy u • Recall: Q(x,a) = R(x,a) + E [VH-1*(y)] = R(x,a) + E [maxu VH-1u(y)] • Given a base policyu, use R(x,a) + E [VH-1u(y)] as an lower bound estimate of Q-value. • Resulting policy is PI(u), given infinite sampling Bob Givan Electrical and Computer Engineering Purdue University

Policy Roll-out (cont’d) Bob Givan Electrical and Computer Engineering Purdue University

Parallel Policy Rollout • Generalization of policy rollout, due to[Chang, Givan, and Chong, 2000] • Given a set U of base policies, use R(x,a) + E [maxu∊UVH-1u(y)] as an estimate of Q-value • More accurate estimate than policy rollout • Still gives a lower bound to true Q-value • Still gives a policy no worse than any in U Bob Givan Electrical and Computer Engineering Purdue University

Hindsight Optimization – Tree View Bob Givan Electrical and Computer Engineering Purdue University

Hindsight Optimization – Equations • Swap Max and Exp in expectimax tree. • Solve each off-line optimization problem • O (kC’ • f(H)) time • where f(H) is the offline problem complexity • Jensen’s inequality implies upper bounds Bob Givan Electrical and Computer Engineering Purdue University

Hindsight Optimization (Cont’d) Bob Givan Electrical and Computer Engineering Purdue University

Application to Example Problems • Apply unbiased sampling, policy rollout, parallel rollout, and hindsight optimization to: • Multi-class deadline scheduling • Random early dropping • Congestion control Bob Givan Electrical and Computer Engineering Purdue University

Basic Approach • Traffic model provides a stochastic description of possible future outcomes • Method • Formulate network decision problems as POMDPs by incorporating traffic model • Solve belief-state MDP online using sampling(choose time-scale to allow for computation time) Bob Givan Electrical and Computer Engineering Purdue University

Domain 1: Deadline Scheduling Objective: Minimize weighted loss Bob Givan Electrical and Computer Engineering Purdue University

Domain 2: Random Early Dropping Objective: Minimize delaywithout sacrificing throughput Bob Givan Electrical and Computer Engineering Purdue University

Domain 3: Congestion Control Bob Givan Electrical and Computer Engineering Purdue University

Traffic Modeling • A Hidden Markov Model (HMM) for each source • Note: state is hidden, model is partially observed Bob Givan Electrical and Computer Engineering Purdue University

Deadline Scheduling Results Non-sampling Policies: • EDF: earliest deadline first. • Deadline sensitive, class insensitive. • SP: static priority. • Deadline insensitive, class sensitive. • CM: current minloss [Givan et al., 2000] • Deadline and class sensitive. • Minimizes weighted loss for the current packets. Bob Givan Electrical and Computer Engineering Purdue University

Deadline Scheduling Results • Objective: minimize weighted loss • Comparison: • Non-sampling policies • Unbiased sampling (Kearns et al.) • Hindsight optimization • Rollout with CM as base policy • Parallel rollout • Results due to H. S. Chang Bob Givan Electrical and Computer Engineering Purdue University

Deadline Scheduling Results Bob Givan Electrical and Computer Engineering Purdue University

Random Early Dropping Results • Objective: minimize delay subject to throughput loss-tolerance • Comparison: • Candidate policies: RED and “buffer-k” • KMN-sampling • Rollout of buffer-k • Parallel rollout • Hindsight optimization • Results due to H. S. Chang. Bob Givan Electrical and Computer Engineering Purdue University

Random Early Dropping Results Bob Givan Electrical and Computer Engineering Purdue University

Congestion Control Results • MDP Objective: minimize weighted sum of throughput, delay, and loss-rate • Fairness is hard-wired • Comparisons: • PD-k (proportional-derivative with k target queue) • Hindsight optimization • Rollout of PD-k == parallel rollout • Results due to G. Wu, in progress Bob Givan Electrical and Computer Engineering Purdue University

Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University

Results Summary • Unbiased sampling cannot cope • Parallel rollout wins in 2 domains • Not always equal to simple rollout of one base policy • Hindsight optimization wins in 1 domain • Simple policy rollout – the cheapest method • Poor in domain 1 • Strong in domain 2 with best base policy – but how to find this policy? • So-so in domain 3 with any base policy Bob Givan Electrical and Computer Engineering Purdue University

Talk Summary • Case study of MDP sampling methods • New methods offering practical improvements • Parallel policy rollout • Hindsight optimization • Systematic methods for using traffic models to help make network control decisions • Feasibility of real-time implementation depends on problem timescale Bob Givan Electrical and Computer Engineering Purdue University

Ongoing Research • Apply to other control problems (different timescales): • Admission/access control • QoS routing • Link bandwidth allotment • Multiclass connection management • Problems arising in proxy-services • Diagnosis and recovery Bob Givan Electrical and Computer Engineering Purdue University

Ongoing Research (Cont’d) • Alternative traffic models • Multi-timescale models • Long-range dependent models • Closed-loop traffic • Fluid models • Learning traffic model online • Adaptation to changing traffic conditions Bob Givan Electrical and Computer Engineering Purdue University

Congestion Control (Cont’d) Bob Givan Electrical and Computer Engineering Purdue University

Hindsight Optimization (Cont’d) Bob Givan Electrical and Computer Engineering Purdue University

Policy Rollout (Cont’d) Policy-performance Base Policy Bob Givan Electrical and Computer Engineering Purdue University

Receding-horizon Control • For large horizon H, policy is ~ stationary. • At each time, if state is x, then apply action u*(x) = argmaxaQ(x,a) = argmaxaR(x,a) + E [VH-1*(y)] • Compute estimate of Q-value at each time. Bob Givan Electrical and Computer Engineering Purdue University

Congestion Control (Cont’d) Bob Givan Electrical and Computer Engineering Purdue University

. . . . . . Domain 3: Congestion Control High-priority Traffic Bottleneck Node Best-effort Traffic • Resources: Bandwidth and buffer • Objective: optimize throughput, delay, loss, and fairness • High-priority traffic: • Open-loop controlled • Low-priority traffic: • Closed-loop controlled Bob Givan Electrical and Computer Engineering Purdue University

Online Sampling for Markov Decision Processes

Online Sampling for Markov Decision Processes

Presentation Transcript

Concurrent Markov Decision Processes

Markov Decision Processes

Markov Decision Processes

CSE 473 Markov Decision Processes

CSE 473 Markov Decision Processes

Partially Observable Markov Decision Processes

Markov Decision Processes

Markov Decision Processes Basics Concepts

Markov Decision Processes

Partially-Observable Markov Decision Processes

Markov Decision Processes: A Survey

Markov Decision Processes

Markov Decision Processes

Solving Large Markov Decision Processes

Markov Decision Processes

Markov Decision Processes

Markov Decision Processes: Approximate Equivalence

Markov Decision Processes

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Chapter 17