1 / 48

MDP Exact Solutions II and Appl. - PowerPoint PPT Presentation

MDP Exact Solutions II and Appl. Jeffrey Chyan Department of Computer Science Rice University Slides adapted from Mausam and Andrey Kolobov. Outline. Policy Iteration (3.3) Value Iteration (3.4) Prioritization in Value Iteration (3.5) / Partitioned Value Iteration (3.6)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about ' MDP Exact Solutions II and Appl.' - ting

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

• Jeffrey Chyan

• Department of Computer Science

• Rice University

• Slides adapted from Mausam and Andrey Kolobov

• Policy Iteration (3.3)

• Value Iteration (3.4)

• Prioritization in Value Iteration (3.5) / Partitioned Value Iteration (3.6)

• Linear Programming Formulation (3.7)

• Infinite-Horizon Discounted-Reward MDPs (3.8)

• Finite-Horizon MDPs (3.9)

• MDPs with Dead Ends (3.10)

• Finding the best policy for MDPs

• Policy Iteration

• Value Iteration

• Linear Programming

• Agent pays a cost to achieve goal

• Exists at least one proper policy

• Every improper policy incurs a cost of infinite from every state from which it does not reach the goal with P=1

• IHDR and FH ⊆ SSP

• For this presentation, assume SSP unless stated otherwise

• Value Function: maps the domain of a policy excluding the action set to a scalar value

• Value of a policy is the expected utility of the reward sequence from executing the policy

• Policy Evaluation: Given a policy, compute the value function for each state

• Solving system of equations

• Iterative Approach

• Find the best policy

• Brute-force algorithm: given all policies are proper, enumerate all policies, evaluate them, and return the best one

• Exponential number of policies, computationally intractable

• Need a more intelligent search for best policy

• Q-value under a Value Function: the one-step lookahead computation of the value of taking an action a in state s

• Under the belief that value function V is the true expected cost to reach a goal

• Denoted as QV(s,a)

• QV(s,a) = Ʃs’∈SΤ(s,a,s’) [C(s,a,s’) + V(s’)]

• Action Greedy w.r.t. a Value Function: an action that has the lowest Q-value

• a = argmina’QV(s,a’)

• Greedy Policy: a policy with all greedy actions w.r.t. V for each state

• Initialize π0 as a random proper policy

• Repeat

• Policy Evaluation: Compute Vπn-1

• Policy Improvement: Construct πn greedy w.r.t. Vπn-1

• Until πn == πn-1

• Return πn

• Computes a greedy policy under Vπn-1

• First compute the Q-value of each action under Vπn-1 in a given state

• Then assign a greedy action in the state as πn

• Policy Iteration for an SSP (initialized with a proper policy π0)

• Successively improves the policy in each iteration

• Vπn(s) ≤ Vπn-1(s)

• Converges to an optimal policy

• Use iterative procedure for policy evaluation instead of system of equations

• Use final value function from the previous iteration Vπn-1 instead of arbitrary value initialization V0πn

• Initialize π0 as a random proper policy

• Repeat

• Approximate Policy Evaluation: Compute Vπn-1

• by running only a few iterations of iterative policy evaluation

• Policy Improvement: Construct πn greedy w.r.t. Vπn-1

• Until πn == πn-1

• Return πn

• Policy evaluation step will diverge

• How to get a proper policy?

• No domain independent algorithm

• Policy iteration for SSPs is not generically applicable

• Search space changes

• Policy Iteration

• Search over policies

• Compute the resulting value

• Value Iteration

• Search over values

• Compute the resulting policy

• Value Iteration based on set of Bellman equations

• Bellman equations mathematically express the optimal solution of an MDP

• Recursive expansion to compute optimal value function

• Optimal Q-value of a state-action pair: the minimum expected cost to reach a goal starting in state s if the agent’s first action is a

• Denoted as Q*(s,a)

• V*(s) = 0 (if s ∈ G) = mina ∈ AQ*(s,a) (s not ∈ G)

• Q*(s,a) = Ʃs’ ∈ SΤ(s,a,s’) [C(s,a,s’) + V*(s’)]

• Restatement of optimality principle for SSP MDPs

Expected cost to first execute action in state, and then follow optimal policy

• Q*(s,a) = Ʃs’ ∈ SΤ(s,a,s’) [C(s,a,s’) + V*(s’)]

• V*(s) = 0 (if s ∈ G) = mina ∈ AQ*(s,a) (s not ∈ G)

• The minimization over all actions makes the equations non-linear

Already at goal, don’t need to take action

Pick best action, minimize expected cost

• Iterative refinement

• Vn(s)  mina ∈ AƩs’∈SΤ(s,a,s’) [C(s,a,s’) + Vn-1(s’)]

• Bellman Backup: computes a new value at state s by backing up the successor values V(s’)

• Uses Vn(s)  mina ∈ AƩs’∈SΤ(s,a,s’) [C(s,a,s’) + Vn-1(s’)]

No restriction on initial value function

Termination condition

ϵ-consistency

• All costs are 1 except for a40=5 and a41=2

• All V0 values initialized as the distance from the goal

• Order of back up states is s0, s1, …, s4

• V1(s0) = min{1+V0(s1), 1+V0(s2)} = 3, similar computations for s1, s2, s3

• Q1(s4,a41) = 2+0.6*0+0.4*2 = 2.8 and Q1(s4,a40) = 5

• V1(s4) = min{2.8, 5} = 2.8

• Value Iteration converges to the optimal value in the limit without restrictions

• For an SSP MDP, ∀s ∈ S

• limn∞ Vn(s) = V*(s) irrespective of the initialization

• Residual at state s (Bellman Backup): the magnitude of the change in the value of state s if Bellman backup is applied to V at s once

• Denoted ResV(s)

• Residual, ResV, is the maximum residual across all states

• ϵ-consistency: a state s is called ϵ-consistent w.r.t. a value function Vif the residual at s w.r.t. V is less than ϵ

• A value function V is ϵ-consistent if it is ϵ-consistent at all states

• Terminate VI if all residuals are small

• Each Bellman backup:

• Go over all states and all successors: O(|S||A|)

• Each Iteration:

• Backup all states: O(|S|2|A|)

• Number of iterations:

• General SSPs: non-trivial bounds don’t exist

• For all n > k

• Vk≤p V*  Vn≤p V* (Vn monotonic from below)

• Vk≥p V*  Vn≥p V* (Vn monotonic from above)

• If a value function V1 is componentwise greater (or less) than another value function V2, then the same inequality holds true between T(V1) and T(V2)

• Bellman backup operator in VI is monotonic

• Value iteration requires full sweeps of state space

• It is not essential to backup all state in an iteration

• Asynchronous value iteration requires additional restriction so that no state is starved and convergence holds

• Termination condition checks if current value function is ϵ-consistent

• There are wasteful backups that occur in value iteration

• Need to choose intelligent backup order and define a priority

• Higher priority states backed up earlier

• Avoid backing up a state when:

• None of the successors of the state have had a change in value since the last backup

• This means backing up the state will not change its value

• If a state’s value changes prioritize its predecessors

• Estimate the expected change in the value of a state if a backup were to be performed

• Converges to optimal in limit if all initial priorities are non-zero

• Instead of estimating residual, compute exact value as priority

• First backup, then push state in queue

• Low V(s) states (closer to goal) are higher priority initially

• As residual reduces for states closer to goal, priority of other states increase

• Prioritized Value Iteration without priority queue

• Backup states in reverse order starting from goal

• No overhead of priority queue and good information flow

• Synchronous Value Iteration: when states highly interconnected

• Prioritized Sweeping/Generalized Prioritize Sweeping: sequential dependencies

• Improved Prioritized Sweeping: specific way to tradeoff proximity to goal/info flow

• Backward Value Iteration: better for domains with fewer predecessors

• Partition the state space

• Need to stabilize mutual co-dependencies before focusing attention on states in other partitions

• External-memory algorithms

• PEMVI

• Cache-efficient algorithms

• P-EVA algorithm

• Parallelized algorithms

• P3VI

• α(s) are the state-relevance weights

• For exact solution, unimportant and can be set to positive number (e.g. 1)

• |S| variables

• |S| |A| constraints

• Computing exact solution is slower than value iteration

• Better for specific kind of approximation where the value of a state is approximated with a sum of basis functions

• V*(s) = maxa ∈ A Ʃs’ ∈ SΤ(s,a,s’) [R(s,a,s’) + γV*(s’)]

• Value Iteration and Policy Iteration work even better than SSPs

• Policy Iteration does not require a “proper” policy

• Convergence stronger, bounds tighter

• Can bound number of iterations

• V*(s,t) = 0 if t > L = maxa ∈ A Ʃs’ ∈ SΤ(s,a,s’) [R(s,a,s’) + V*(s’,t+1)]

• Finite-Horizon MDPs are acyclic

• There exists an optimal backup order

• t = Tmax to 0

• Returns optimal values (not just ϵ-consistent)

• Performs one backup per augmented state

• Dead-End State: a state s in S such that no policy can reach the goal from s in any number of time steps

• SSP MDPs unable to model domains with dead ends

• If allow dead-end states in value iteration, V*(s) not defined for dead-end state and diverges

• fSSPDE: a tuple <S, A, T, C, G, P>

• S, A, T, C, G are same as in SSP MDP

• P ∈ ℝ+ denotes the penalty incurred when an agent decides to abort the process in a non-goal state, under the condition:

• For every improper stationary deterministic Markovian policy π, for every s ∈ S where π is improper, the value of π at s under the expected linear additive utility without the possibility of stopping the process by paying the penalty is infinite

• Policy Iteration

• Convergence dependent on initial policy being proper (unless IHDR)

• Value Iteration

• Doesn’t require initial proper policy

• For IHDR has stronger error bounds upon reaching ϵ-consistency

• Linear Programming

• Computing exact solution is slower than value iteration

• Policy Iteration

• Value Iteration

• Prioritizing and Partitioning Value Iteration

• Linear Programming (alternative solution)

• Special Cases:

• IHDR MDP and FH MDP