1 / 51

Online Sampling for Markov Decision Processes

Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu. Online Sampling for Markov Decision Processes. Electrical and Computer Engineering Purdue University. Markov Decision Process (MDP). Ingredients: System state x in state space X Control action a in A ( x ) Reward R ( x,a )

edie
Download Presentation

Online Sampling for Markov Decision Processes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Online SamplingforMarkov Decision Processes Electrical and Computer Engineering Purdue University

  2. Markov Decision Process (MDP) • Ingredients: • System state x in state space X • Control action a in A(x) • Reward R(x,a) • State-transition probability P(x,y,a) • Find control policy to maximize objective fun Bob Givan Electrical and Computer Engineering Purdue University

  3. Optimal Policies • Policy – mapping from state and time to actions • Stationary Policy – mapping from state to actions • Goal – a policy maximizing the objective function VH*(x0) = max Obj [R(x0,a0), …, R(xH-1,aH-1)] wherethe “max” is over all policies u = u0,…,uH-1 • For large H, a0 independent of H. (w/ergodicity assum.) • Stationary optimal action a0for H =  via receding horizon control Bob Givan Electrical and Computer Engineering Purdue University

  4. Q Values Fix a large H, focus on finite-horizon reward • Define Q(x,a) = R(x,a) + E[VH-1*(y)] • “Utility” of action a at state x. • Name: Q-value of action a at state x. • Key identities (Bellman’s equations): • VH*(x) = maxaQ(x,a) • 0*(x) = argmaxaQ(x,a) Bob Givan Electrical and Computer Engineering Purdue University

  5. Solution Methods • Recall: • u0*(x) = argmaxaQ(x,a) • Q(x,a) =R(x,a) + E [VH-1*(y)] • Problem: Q-value depends on optimal policy. • State space is extremely large (often continuous) • Two-pronged solution approach: • Apply a receding-horizon method • Estimate Q-values via simulation/sampling Bob Givan Electrical and Computer Engineering Purdue University

  6. Methods for Q-value Estimation Previous work by other authors: • Unbiased sampling (exact Q value)[Kearns et al., IJCAI-99] • Policy rollout (lower bound)[Bertsekas & Castanon, 1999] Our techniques: • Hindsight optimization (upper bound) • Parallel rollout (lower bound) Bob Givan Electrical and Computer Engineering Purdue University

  7. Expectimax Tree for V* Bob Givan Electrical and Computer Engineering Purdue University

  8. Unbiased Sampling Bob Givan Electrical and Computer Engineering Purdue University

  9. Unbiased Sampling (Cont’d) • For a given desired accuracy, how largeshould sampling width and depth be? • Answered: Kearns, Mansour, and Ng (1999) • Requires prohibitive sampling width and depth • e.g. C 108, Hs > 60 to distinguish “best” and “worst” policies in our scheduling domain • We evaluate with smaller width and depth Bob Givan Electrical and Computer Engineering Purdue University

  10. How to Look Deeper? Bob Givan Electrical and Computer Engineering Purdue University

  11. Policy Roll-out Bob Givan Electrical and Computer Engineering Purdue University

  12. Policy Rollout in Equations • Write VHu(y) for the value of following policy u • Recall: Q(x,a) = R(x,a) + E [VH-1*(y)] = R(x,a) + E [maxu VH-1u(y)] • Given a base policyu, use R(x,a) + E [VH-1u(y)] as an lower bound estimate of Q-value. • Resulting policy is PI(u), given infinite sampling Bob Givan Electrical and Computer Engineering Purdue University

  13. Policy Roll-out (cont’d) Bob Givan Electrical and Computer Engineering Purdue University

  14. Parallel Policy Rollout • Generalization of policy rollout, due to[Chang, Givan, and Chong, 2000] • Given a set U of base policies, use R(x,a) + E [maxu∊UVH-1u(y)] as an estimate of Q-value • More accurate estimate than policy rollout • Still gives a lower bound to true Q-value • Still gives a policy no worse than any in U Bob Givan Electrical and Computer Engineering Purdue University

  15. Hindsight Optimization – Tree View Bob Givan Electrical and Computer Engineering Purdue University

  16. Hindsight Optimization – Equations • Swap Max and Exp in expectimax tree. • Solve each off-line optimization problem • O (kC’ • f(H)) time • where f(H) is the offline problem complexity • Jensen’s inequality implies upper bounds Bob Givan Electrical and Computer Engineering Purdue University

  17. Hindsight Optimization (Cont’d) Bob Givan Electrical and Computer Engineering Purdue University

  18. Application to Example Problems • Apply unbiased sampling, policy rollout, parallel rollout, and hindsight optimization to: • Multi-class deadline scheduling • Random early dropping • Congestion control Bob Givan Electrical and Computer Engineering Purdue University

  19. Basic Approach • Traffic model provides a stochastic description of possible future outcomes • Method • Formulate network decision problems as POMDPs by incorporating traffic model • Solve belief-state MDP online using sampling(choose time-scale to allow for computation time) Bob Givan Electrical and Computer Engineering Purdue University

  20. Domain 1: Deadline Scheduling Objective: Minimize weighted loss Bob Givan Electrical and Computer Engineering Purdue University

  21. Domain 2: Random Early Dropping Objective: Minimize delaywithout sacrificing throughput Bob Givan Electrical and Computer Engineering Purdue University

  22. Domain 3: Congestion Control Bob Givan Electrical and Computer Engineering Purdue University

  23. Traffic Modeling • A Hidden Markov Model (HMM) for each source • Note: state is hidden, model is partially observed Bob Givan Electrical and Computer Engineering Purdue University

  24. Deadline Scheduling Results Non-sampling Policies: • EDF: earliest deadline first. • Deadline sensitive, class insensitive. • SP: static priority. • Deadline insensitive, class sensitive. • CM: current minloss [Givan et al., 2000] • Deadline and class sensitive. • Minimizes weighted loss for the current packets. Bob Givan Electrical and Computer Engineering Purdue University

  25. Deadline Scheduling Results • Objective: minimize weighted loss • Comparison: • Non-sampling policies • Unbiased sampling (Kearns et al.) • Hindsight optimization • Rollout with CM as base policy • Parallel rollout • Results due to H. S. Chang Bob Givan Electrical and Computer Engineering Purdue University

  26. Deadline Scheduling Results Bob Givan Electrical and Computer Engineering Purdue University

  27. Deadline Scheduling Results Bob Givan Electrical and Computer Engineering Purdue University

  28. Deadline Scheduling Results Bob Givan Electrical and Computer Engineering Purdue University

  29. Random Early Dropping Results • Objective: minimize delay subject to throughput loss-tolerance • Comparison: • Candidate policies: RED and “buffer-k” • KMN-sampling • Rollout of buffer-k • Parallel rollout • Hindsight optimization • Results due to H. S. Chang. Bob Givan Electrical and Computer Engineering Purdue University

  30. Random Early Dropping Results Bob Givan Electrical and Computer Engineering Purdue University

  31. Random Early Dropping Results Bob Givan Electrical and Computer Engineering Purdue University

  32. Congestion Control Results • MDP Objective: minimize weighted sum of throughput, delay, and loss-rate • Fairness is hard-wired • Comparisons: • PD-k (proportional-derivative with k target queue) • Hindsight optimization • Rollout of PD-k == parallel rollout • Results due to G. Wu, in progress Bob Givan Electrical and Computer Engineering Purdue University

  33. Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University

  34. Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University

  35. Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University

  36. Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University

  37. Results Summary • Unbiased sampling cannot cope • Parallel rollout wins in 2 domains • Not always equal to simple rollout of one base policy • Hindsight optimization wins in 1 domain • Simple policy rollout – the cheapest method • Poor in domain 1 • Strong in domain 2 with best base policy – but how to find this policy? • So-so in domain 3 with any base policy Bob Givan Electrical and Computer Engineering Purdue University

  38. Talk Summary • Case study of MDP sampling methods • New methods offering practical improvements • Parallel policy rollout • Hindsight optimization • Systematic methods for using traffic models to help make network control decisions • Feasibility of real-time implementation depends on problem timescale Bob Givan Electrical and Computer Engineering Purdue University

  39. Ongoing Research • Apply to other control problems (different timescales): • Admission/access control • QoS routing • Link bandwidth allotment • Multiclass connection management • Problems arising in proxy-services • Diagnosis and recovery Bob Givan Electrical and Computer Engineering Purdue University

  40. Ongoing Research (Cont’d) • Alternative traffic models • Multi-timescale models • Long-range dependent models • Closed-loop traffic • Fluid models • Learning traffic model online • Adaptation to changing traffic conditions Bob Givan Electrical and Computer Engineering Purdue University

  41. Congestion Control (Cont’d) Bob Givan Electrical and Computer Engineering Purdue University

  42. Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University

  43. Hindsight Optimization (Cont’d) Bob Givan Electrical and Computer Engineering Purdue University

  44. Policy Rollout (Cont’d) Policy-performance Base Policy Bob Givan Electrical and Computer Engineering Purdue University

  45. Receding-horizon Control • For large horizon H, policy is ~ stationary. • At each time, if state is x, then apply action u*(x) = argmaxaQ(x,a) = argmaxaR(x,a) + E [VH-1*(y)] • Compute estimate of Q-value at each time. Bob Givan Electrical and Computer Engineering Purdue University

  46. Congestion Control (Cont’d) Bob Givan Electrical and Computer Engineering Purdue University

  47. . . . . . . Domain 3: Congestion Control High-priority Traffic Bottleneck Node Best-effort Traffic • Resources: Bandwidth and buffer • Objective: optimize throughput, delay, loss, and fairness • High-priority traffic: • Open-loop controlled • Low-priority traffic: • Closed-loop controlled Bob Givan Electrical and Computer Engineering Purdue University

  48. Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University

  49. Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University

  50. Congestion Control Results Bob Givan Electrical and Computer Engineering Purdue University

More Related