Mdp exact solutions ii and appl
1 / 48

MDP Exact Solutions II and Appl. - PowerPoint PPT Presentation

  • Uploaded on

MDP Exact Solutions II and Appl. Jeffrey Chyan Department of Computer Science Rice University Slides adapted from Mausam and Andrey Kolobov. Outline. Policy Iteration (3.3) Value Iteration (3.4) Prioritization in Value Iteration (3.5) / Partitioned Value Iteration (3.6)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' MDP Exact Solutions II and Appl.' - ting

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Mdp exact solutions ii and appl
MDP Exact Solutions II and Appl.

  • Jeffrey Chyan

  • Department of Computer Science

  • Rice University

  • Slides adapted from Mausam and Andrey Kolobov


  • Policy Iteration (3.3)

  • Value Iteration (3.4)

  • Prioritization in Value Iteration (3.5) / Partitioned Value Iteration (3.6)

  • Linear Programming Formulation (3.7)

  • Infinite-Horizon Discounted-Reward MDPs (3.8)

  • Finite-Horizon MDPs (3.9)

  • MDPs with Dead Ends (3.10)

Solving mdps
Solving MDPs

  • Finding the best policy for MDPs

    • Policy Iteration

    • Value Iteration

    • Linear Programming

Recall ssp mdps
Recall SSP MDPs

  • Agent pays a cost to achieve goal

  • Exists at least one proper policy

  • Every improper policy incurs a cost of infinite from every state from which it does not reach the goal with P=1

  • IHDR and FH ⊆ SSP

  • For this presentation, assume SSP unless stated otherwise

Recall value and evaluation
Recall Value and Evaluation

  • Value Function: maps the domain of a policy excluding the action set to a scalar value

    • Value of a policy is the expected utility of the reward sequence from executing the policy

  • Policy Evaluation: Given a policy, compute the value function for each state

    • Solving system of equations

    • Iterative Approach


  • Find the best policy

  • Brute-force algorithm: given all policies are proper, enumerate all policies, evaluate them, and return the best one

    • Exponential number of policies, computationally intractable

  • Need a more intelligent search for best policy

The q value under a value function v
The Q-Value Under a Value Function V

  • Q-value under a Value Function: the one-step lookahead computation of the value of taking an action a in state s

    • Under the belief that value function V is the true expected cost to reach a goal

    • Denoted as QV(s,a)

    • QV(s,a) = Ʃs’∈SΤ(s,a,s’) [C(s,a,s’) + V(s’)]

Greedy action policy
Greedy Action/Policy

  • Action Greedy w.r.t. a Value Function: an action that has the lowest Q-value

    • a = argmina’QV(s,a’)

  • Greedy Policy: a policy with all greedy actions w.r.t. V for each state

Policy iteration
Policy Iteration

  • Initialize π0 as a random proper policy

  • Repeat

    • Policy Evaluation: Compute Vπn-1

    • Policy Improvement: Construct πn greedy w.r.t. Vπn-1

  • Until πn == πn-1

  • Return πn

Policy improvement
Policy Improvement

  • Computes a greedy policy under Vπn-1

    • First compute the Q-value of each action under Vπn-1 in a given state

    • Then assign a greedy action in the state as πn

Properties of policy iteration
Properties of Policy Iteration

  • Policy Iteration for an SSP (initialized with a proper policy π0)

    • Successively improves the policy in each iteration

    • Vπn(s) ≤ Vπn-1(s)

    • Converges to an optimal policy

Modified policy iteration
Modified Policy Iteration

  • Use iterative procedure for policy evaluation instead of system of equations

  • Use final value function from the previous iteration Vπn-1 instead of arbitrary value initialization V0πn

Modified policy iteration1
Modified Policy Iteration

  • Initialize π0 as a random proper policy

  • Repeat

    • Approximate Policy Evaluation: Compute Vπn-1

      • by running only a few iterations of iterative policy evaluation

    • Policy Improvement: Construct πn greedy w.r.t. Vπn-1

  • Until πn == πn-1

  • Return πn

Limitations of policy iteration
Limitations of Policy Iteration

  • Why do we need to start with a proper policy?

    • Policy evaluation step will diverge

  • How to get a proper policy?

    • No domain independent algorithm

  • Policy iteration for SSPs is not generically applicable

From policy iteration to value iteration
From Policy Iteration To Value Iteration

  • Search space changes

  • Policy Iteration

    • Search over policies

    • Compute the resulting value

  • Value Iteration

    • Search over values

    • Compute the resulting policy

Bellman equations
Bellman Equations

  • Value Iteration based on set of Bellman equations

  • Bellman equations mathematically express the optimal solution of an MDP

  • Recursive expansion to compute optimal value function

Bellman equations1
Bellman Equations

  • Optimal Q-value of a state-action pair: the minimum expected cost to reach a goal starting in state s if the agent’s first action is a

    • Denoted as Q*(s,a)

  • V*(s) = 0 (if s ∈ G) = mina ∈ AQ*(s,a) (s not ∈ G)

  • Q*(s,a) = Ʃs’ ∈ SΤ(s,a,s’) [C(s,a,s’) + V*(s’)]

    • Restatement of optimality principle for SSP MDPs

Bellman equations2
Bellman Equations

Expected cost to first execute action in state, and then follow optimal policy

  • Q*(s,a) = Ʃs’ ∈ SΤ(s,a,s’) [C(s,a,s’) + V*(s’)]

  • V*(s) = 0 (if s ∈ G) = mina ∈ AQ*(s,a) (s not ∈ G)

  • The minimization over all actions makes the equations non-linear

Already at goal, don’t need to take action

Pick best action, minimize expected cost

Bellman backup
Bellman Backup

  • Iterative refinement

    • Vn(s)  mina ∈ AƩs’∈SΤ(s,a,s’) [C(s,a,s’) + Vn-1(s’)]

    • Bellman Backup: computes a new value at state s by backing up the successor values V(s’)

      • Uses Vn(s)  mina ∈ AƩs’∈SΤ(s,a,s’) [C(s,a,s’) + Vn-1(s’)]

Value iteration
Value Iteration

No restriction on initial value function

Termination condition



  • All costs are 1 except for a40=5 and a41=2

  • All V0 values initialized as the distance from the goal

  • Order of back up states is s0, s1, …, s4

Example first iteration
Example – First Iteration

  • V1(s0) = min{1+V0(s1), 1+V0(s2)} = 3, similar computations for s1, s2, s3

  • Q1(s4,a41) = 2+0.6*0+0.4*2 = 2.8 and Q1(s4,a40) = 5

  • V1(s4) = min{2.8, 5} = 2.8

Vi convergence and optimality
VI - Convergence and Optimality

  • Value Iteration converges to the optimal value in the limit without restrictions

  • For an SSP MDP, ∀s ∈ S

    • limn∞ Vn(s) = V*(s) irrespective of the initialization

Vi termination
VI - Termination

  • Residual at state s (Bellman Backup): the magnitude of the change in the value of state s if Bellman backup is applied to V at s once

    • Denoted ResV(s)

    • Residual, ResV, is the maximum residual across all states

  • ϵ-consistency: a state s is called ϵ-consistent w.r.t. a value function Vif the residual at s w.r.t. V is less than ϵ

    • A value function V is ϵ-consistent if it is ϵ-consistent at all states

  • Terminate VI if all residuals are small

Vi running time
VI - Running Time

  • Each Bellman backup:

    • Go over all states and all successors: O(|S||A|)

  • Each Iteration:

    • Backup all states: O(|S|2|A|)

  • Number of iterations:

    • General SSPs: non-trivial bounds don’t exist


  • For all n > k

    • Vk≤p V*  Vn≤p V* (Vn monotonic from below)

    • Vk≥p V*  Vn≥p V* (Vn monotonic from above)

  • If a value function V1 is componentwise greater (or less) than another value function V2, then the same inequality holds true between T(V1) and T(V2)

  • Bellman backup operator in VI is monotonic

Value iteration to asynchronous value iteration
Value Iteration to Asynchronous Value Iteration

  • Value iteration requires full sweeps of state space

    • It is not essential to backup all state in an iteration

  • Asynchronous value iteration requires additional restriction so that no state is starved and convergence holds

    • Termination condition checks if current value function is ϵ-consistent

Priority backup
Priority Backup

  • There are wasteful backups that occur in value iteration

  • Need to choose intelligent backup order and define a priority

  • Higher priority states backed up earlier

What state to prioritize
What State to Prioritize?

  • Avoid backing up a state when:

    • None of the successors of the state have had a change in value since the last backup

    • This means backing up the state will not change its value

Prioritized sweeping
Prioritized Sweeping

  • If a state’s value changes prioritize its predecessors

  • Estimate the expected change in the value of a state if a backup were to be performed

  • Converges to optimal in limit if all initial priorities are non-zero

Generalized prioritized sweeping
Generalized Prioritized Sweeping

  • Instead of estimating residual, compute exact value as priority

  • First backup, then push state in queue

Improved prioritized sweeping
Improved Prioritized Sweeping

  • Low V(s) states (closer to goal) are higher priority initially

  • As residual reduces for states closer to goal, priority of other states increase

Backward value iteration
Backward Value Iteration

  • Prioritized Value Iteration without priority queue

  • Backup states in reverse order starting from goal

  • No overhead of priority queue and good information flow

Which priority algorithm to use
Which priority algorithm to use?

  • Synchronous Value Iteration: when states highly interconnected

  • Prioritized Sweeping/Generalized Prioritize Sweeping: sequential dependencies

  • Improved Prioritized Sweeping: specific way to tradeoff proximity to goal/info flow

  • Backward Value Iteration: better for domains with fewer predecessors

Partitioned value iteration
Partitioned Value Iteration

  • Partition the state space

  • Need to stabilize mutual co-dependencies before focusing attention on states in other partitions

Benefits of partitioning
Benefits of Partitioning

  • External-memory algorithms

    • PEMVI

  • Cache-efficient algorithms

    • P-EVA algorithm

  • Parallelized algorithms

    • P3VI

Linear programming for mdps
Linear Programming for MDPs

  • α(s) are the state-relevance weights

    • For exact solution, unimportant and can be set to positive number (e.g. 1)

Linear programming for mdps1
Linear Programming for MDPs

  • |S| variables

  • |S| |A| constraints

  • Computing exact solution is slower than value iteration

  • Better for specific kind of approximation where the value of a state is approximated with a sum of basis functions

Infinite horizon discounted reward mdps
Infinite-Horizon Discounted-Reward MDPs

  • V*(s) = maxa ∈ A Ʃs’ ∈ SΤ(s,a,s’) [R(s,a,s’) + γV*(s’)]

  • Value Iteration and Policy Iteration work even better than SSPs

    • Policy Iteration does not require a “proper” policy

    • Convergence stronger, bounds tighter

    • Can bound number of iterations

Finite horizon mdps
Finite-Horizon MDPs

  • V*(s,t) = 0 if t > L = maxa ∈ A Ʃs’ ∈ SΤ(s,a,s’) [R(s,a,s’) + V*(s’,t+1)]

  • Finite-Horizon MDPs are acyclic

    • There exists an optimal backup order

      • t = Tmax to 0

    • Returns optimal values (not just ϵ-consistent)

    • Performs one backup per augmented state

Mdps with dead ends
MDPs with Dead Ends

  • Dead-End State: a state s in S such that no policy can reach the goal from s in any number of time steps

  • SSP MDPs unable to model domains with dead ends

  • If allow dead-end states in value iteration, V*(s) not defined for dead-end state and diverges

Finite penalty ssp mdps with dead ends
Finite-Penalty SSP MDPs with Dead-Ends

  • fSSPDE: a tuple <S, A, T, C, G, P>

    • S, A, T, C, G are same as in SSP MDP

    • P ∈ ℝ+ denotes the penalty incurred when an agent decides to abort the process in a non-goal state, under the condition:

      • For every improper stationary deterministic Markovian policy π, for every s ∈ S where π is improper, the value of π at s under the expected linear additive utility without the possibility of stopping the process by paying the penalty is infinite


  • Policy Iteration

    • Convergence dependent on initial policy being proper (unless IHDR)

  • Value Iteration

    • Doesn’t require initial proper policy

    • For IHDR has stronger error bounds upon reaching ϵ-consistency

  • Linear Programming

    • Computing exact solution is slower than value iteration


  • Policy Iteration

  • Value Iteration

  • Prioritizing and Partitioning Value Iteration

  • Linear Programming (alternative solution)

  • Special Cases:

    • IHDR MDP and FH MDP

  • Dead-End States