1 / 48

MDP Exact Solutions II and Appl.

MDP Exact Solutions II and Appl. Jeffrey Chyan Department of Computer Science Rice University Slides adapted from Mausam and Andrey Kolobov. Outline. Policy Iteration (3.3) Value Iteration (3.4) Prioritization in Value Iteration (3.5) / Partitioned Value Iteration (3.6)

ting
Download Presentation

MDP Exact Solutions II and Appl.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MDP Exact Solutions II and Appl. • Jeffrey Chyan • Department of Computer Science • Rice University • Slides adapted from Mausam and Andrey Kolobov

  2. Outline • Policy Iteration (3.3) • Value Iteration (3.4) • Prioritization in Value Iteration (3.5) / Partitioned Value Iteration (3.6) • Linear Programming Formulation (3.7) • Infinite-Horizon Discounted-Reward MDPs (3.8) • Finite-Horizon MDPs (3.9) • MDPs with Dead Ends (3.10)

  3. Solving MDPs • Finding the best policy for MDPs • Policy Iteration • Value Iteration • Linear Programming

  4. Recall SSP MDPs • Agent pays a cost to achieve goal • Exists at least one proper policy • Every improper policy incurs a cost of infinite from every state from which it does not reach the goal with P=1 • IHDR and FH ⊆ SSP • For this presentation, assume SSP unless stated otherwise

  5. Recall Value and Evaluation • Value Function: maps the domain of a policy excluding the action set to a scalar value • Value of a policy is the expected utility of the reward sequence from executing the policy • Policy Evaluation: Given a policy, compute the value function for each state • Solving system of equations • Iterative Approach

  6. Motivation • Find the best policy • Brute-force algorithm: given all policies are proper, enumerate all policies, evaluate them, and return the best one • Exponential number of policies, computationally intractable • Need a more intelligent search for best policy

  7. The Q-Value Under a Value Function V • Q-value under a Value Function: the one-step lookahead computation of the value of taking an action a in state s • Under the belief that value function V is the true expected cost to reach a goal • Denoted as QV(s,a) • QV(s,a) = Ʃs’∈SΤ(s,a,s’) [C(s,a,s’) + V(s’)]

  8. Greedy Action/Policy • Action Greedy w.r.t. a Value Function: an action that has the lowest Q-value • a = argmina’QV(s,a’) • Greedy Policy: a policy with all greedy actions w.r.t. V for each state

  9. Policy Iteration • Initialize π0 as a random proper policy • Repeat • Policy Evaluation: Compute Vπn-1 • Policy Improvement: Construct πn greedy w.r.t. Vπn-1 • Until πn == πn-1 • Return πn

  10. Policy Improvement • Computes a greedy policy under Vπn-1 • First compute the Q-value of each action under Vπn-1 in a given state • Then assign a greedy action in the state as πn

  11. Properties of Policy Iteration • Policy Iteration for an SSP (initialized with a proper policy π0) • Successively improves the policy in each iteration • Vπn(s) ≤ Vπn-1(s) • Converges to an optimal policy

  12. Modified Policy Iteration • Use iterative procedure for policy evaluation instead of system of equations • Use final value function from the previous iteration Vπn-1 instead of arbitrary value initialization V0πn

  13. Modified Policy Iteration • Initialize π0 as a random proper policy • Repeat • Approximate Policy Evaluation: Compute Vπn-1 • by running only a few iterations of iterative policy evaluation • Policy Improvement: Construct πn greedy w.r.t. Vπn-1 • Until πn == πn-1 • Return πn

  14. Limitations of Policy Iteration • Why do we need to start with a proper policy? • Policy evaluation step will diverge • How to get a proper policy? • No domain independent algorithm • Policy iteration for SSPs is not generically applicable

  15. From Policy Iteration To Value Iteration • Search space changes • Policy Iteration • Search over policies • Compute the resulting value • Value Iteration • Search over values • Compute the resulting policy

  16. Bellman Equations • Value Iteration based on set of Bellman equations • Bellman equations mathematically express the optimal solution of an MDP • Recursive expansion to compute optimal value function

  17. Bellman Equations • Optimal Q-value of a state-action pair: the minimum expected cost to reach a goal starting in state s if the agent’s first action is a • Denoted as Q*(s,a) • V*(s) = 0 (if s ∈ G) = mina ∈ AQ*(s,a) (s not ∈ G) • Q*(s,a) = Ʃs’ ∈ SΤ(s,a,s’) [C(s,a,s’) + V*(s’)] • Restatement of optimality principle for SSP MDPs

  18. Bellman Equations Expected cost to first execute action in state, and then follow optimal policy • Q*(s,a) = Ʃs’ ∈ SΤ(s,a,s’) [C(s,a,s’) + V*(s’)] • V*(s) = 0 (if s ∈ G) = mina ∈ AQ*(s,a) (s not ∈ G) • The minimization over all actions makes the equations non-linear Already at goal, don’t need to take action Pick best action, minimize expected cost

  19. Bellman Backup • Iterative refinement • Vn(s)  mina ∈ AƩs’∈SΤ(s,a,s’) [C(s,a,s’) + Vn-1(s’)] • Bellman Backup: computes a new value at state s by backing up the successor values V(s’) • Uses Vn(s)  mina ∈ AƩs’∈SΤ(s,a,s’) [C(s,a,s’) + Vn-1(s’)]

  20. Value Iteration No restriction on initial value function Termination condition ϵ-consistency

  21. Example • All costs are 1 except for a40=5 and a41=2 • All V0 values initialized as the distance from the goal • Order of back up states is s0, s1, …, s4

  22. Example – First Iteration • V1(s0) = min{1+V0(s1), 1+V0(s2)} = 3, similar computations for s1, s2, s3 • Q1(s4,a41) = 2+0.6*0+0.4*2 = 2.8 and Q1(s4,a40) = 5 • V1(s4) = min{2.8, 5} = 2.8

  23. Example

  24. VI - Convergence and Optimality • Value Iteration converges to the optimal value in the limit without restrictions • For an SSP MDP, ∀s ∈ S • limn∞ Vn(s) = V*(s) irrespective of the initialization

  25. VI - Termination • Residual at state s (Bellman Backup): the magnitude of the change in the value of state s if Bellman backup is applied to V at s once • Denoted ResV(s) • Residual, ResV, is the maximum residual across all states • ϵ-consistency: a state s is called ϵ-consistent w.r.t. a value function Vif the residual at s w.r.t. V is less than ϵ • A value function V is ϵ-consistent if it is ϵ-consistent at all states • Terminate VI if all residuals are small

  26. VI - Running Time • Each Bellman backup: • Go over all states and all successors: O(|S||A|) • Each Iteration: • Backup all states: O(|S|2|A|) • Number of iterations: • General SSPs: non-trivial bounds don’t exist

  27. Monotonicity • For all n > k • Vk≤p V*  Vn≤p V* (Vn monotonic from below) • Vk≥p V*  Vn≥p V* (Vn monotonic from above) • If a value function V1 is componentwise greater (or less) than another value function V2, then the same inequality holds true between T(V1) and T(V2) • Bellman backup operator in VI is monotonic

  28. Value Iteration to Asynchronous Value Iteration • Value iteration requires full sweeps of state space • It is not essential to backup all state in an iteration • Asynchronous value iteration requires additional restriction so that no state is starved and convergence holds • Termination condition checks if current value function is ϵ-consistent

  29. Asynchronous Value Iteration

  30. Wasteful Backup

  31. Priority Backup • There are wasteful backups that occur in value iteration • Need to choose intelligent backup order and define a priority • Higher priority states backed up earlier

  32. Prioritized Value Iteration

  33. What State to Prioritize? • Avoid backing up a state when: • None of the successors of the state have had a change in value since the last backup • This means backing up the state will not change its value

  34. Prioritized Sweeping • If a state’s value changes prioritize its predecessors • Estimate the expected change in the value of a state if a backup were to be performed • Converges to optimal in limit if all initial priorities are non-zero

  35. Generalized Prioritized Sweeping • Instead of estimating residual, compute exact value as priority • First backup, then push state in queue

  36. Improved Prioritized Sweeping • Low V(s) states (closer to goal) are higher priority initially • As residual reduces for states closer to goal, priority of other states increase

  37. Backward Value Iteration • Prioritized Value Iteration without priority queue • Backup states in reverse order starting from goal • No overhead of priority queue and good information flow

  38. Which priority algorithm to use? • Synchronous Value Iteration: when states highly interconnected • Prioritized Sweeping/Generalized Prioritize Sweeping: sequential dependencies • Improved Prioritized Sweeping: specific way to tradeoff proximity to goal/info flow • Backward Value Iteration: better for domains with fewer predecessors

  39. Partitioned Value Iteration • Partition the state space • Need to stabilize mutual co-dependencies before focusing attention on states in other partitions

  40. Benefits of Partitioning • External-memory algorithms • PEMVI • Cache-efficient algorithms • P-EVA algorithm • Parallelized algorithms • P3VI

  41. Linear Programming for MDPs • α(s) are the state-relevance weights • For exact solution, unimportant and can be set to positive number (e.g. 1)

  42. Linear Programming for MDPs • |S| variables • |S| |A| constraints • Computing exact solution is slower than value iteration • Better for specific kind of approximation where the value of a state is approximated with a sum of basis functions

  43. Infinite-Horizon Discounted-Reward MDPs • V*(s) = maxa ∈ A Ʃs’ ∈ SΤ(s,a,s’) [R(s,a,s’) + γV*(s’)] • Value Iteration and Policy Iteration work even better than SSPs • Policy Iteration does not require a “proper” policy • Convergence stronger, bounds tighter • Can bound number of iterations

  44. Finite-Horizon MDPs • V*(s,t) = 0 if t > L = maxa ∈ A Ʃs’ ∈ SΤ(s,a,s’) [R(s,a,s’) + V*(s’,t+1)] • Finite-Horizon MDPs are acyclic • There exists an optimal backup order • t = Tmax to 0 • Returns optimal values (not just ϵ-consistent) • Performs one backup per augmented state

  45. MDPs with Dead Ends • Dead-End State: a state s in S such that no policy can reach the goal from s in any number of time steps • SSP MDPs unable to model domains with dead ends • If allow dead-end states in value iteration, V*(s) not defined for dead-end state and diverges

  46. Finite-Penalty SSP MDPs with Dead-Ends • fSSPDE: a tuple <S, A, T, C, G, P> • S, A, T, C, G are same as in SSP MDP • P ∈ ℝ+ denotes the penalty incurred when an agent decides to abort the process in a non-goal state, under the condition: • For every improper stationary deterministic Markovian policy π, for every s ∈ S where π is improper, the value of π at s under the expected linear additive utility without the possibility of stopping the process by paying the penalty is infinite

  47. Comparison • Policy Iteration • Convergence dependent on initial policy being proper (unless IHDR) • Value Iteration • Doesn’t require initial proper policy • For IHDR has stronger error bounds upon reaching ϵ-consistency • Linear Programming • Computing exact solution is slower than value iteration

  48. Summary • Policy Iteration • Value Iteration • Prioritizing and Partitioning Value Iteration • Linear Programming (alternative solution) • Special Cases: • IHDR MDP and FH MDP • Dead-End States

More Related