Loading in 5 sec....

MDP Exact Solutions II and Appl.PowerPoint Presentation

MDP Exact Solutions II and Appl.

- By
**ting** - Follow User

- 132 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' MDP Exact Solutions II and Appl.' - ting

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

MDP Exact Solutions II and Appl.

- Jeffrey Chyan
- Department of Computer Science
- Rice University
- Slides adapted from Mausam and Andrey Kolobov

Outline

- Policy Iteration (3.3)
- Value Iteration (3.4)
- Prioritization in Value Iteration (3.5) / Partitioned Value Iteration (3.6)
- Linear Programming Formulation (3.7)
- Infinite-Horizon Discounted-Reward MDPs (3.8)
- Finite-Horizon MDPs (3.9)
- MDPs with Dead Ends (3.10)

Solving MDPs

- Finding the best policy for MDPs
- Policy Iteration
- Value Iteration
- Linear Programming

Recall SSP MDPs

- Agent pays a cost to achieve goal
- Exists at least one proper policy
- Every improper policy incurs a cost of infinite from every state from which it does not reach the goal with P=1
- IHDR and FH ⊆ SSP
- For this presentation, assume SSP unless stated otherwise

Recall Value and Evaluation

- Value Function: maps the domain of a policy excluding the action set to a scalar value
- Value of a policy is the expected utility of the reward sequence from executing the policy

- Policy Evaluation: Given a policy, compute the value function for each state
- Solving system of equations
- Iterative Approach

Motivation

- Find the best policy
- Brute-force algorithm: given all policies are proper, enumerate all policies, evaluate them, and return the best one
- Exponential number of policies, computationally intractable

- Need a more intelligent search for best policy

The Q-Value Under a Value Function V

- Q-value under a Value Function: the one-step lookahead computation of the value of taking an action a in state s
- Under the belief that value function V is the true expected cost to reach a goal
- Denoted as QV(s,a)
- QV(s,a) = Ʃs’∈SΤ(s,a,s’) [C(s,a,s’) + V(s’)]

Greedy Action/Policy

- Action Greedy w.r.t. a Value Function: an action that has the lowest Q-value
- a = argmina’QV(s,a’)

- Greedy Policy: a policy with all greedy actions w.r.t. V for each state

Policy Iteration

- Initialize π0 as a random proper policy
- Repeat
- Policy Evaluation: Compute Vπn-1
- Policy Improvement: Construct πn greedy w.r.t. Vπn-1

- Until πn == πn-1
- Return πn

Policy Improvement

- Computes a greedy policy under Vπn-1
- First compute the Q-value of each action under Vπn-1 in a given state
- Then assign a greedy action in the state as πn

Properties of Policy Iteration

- Policy Iteration for an SSP (initialized with a proper policy π0)
- Successively improves the policy in each iteration
- Vπn(s) ≤ Vπn-1(s)
- Converges to an optimal policy

Modified Policy Iteration

- Use iterative procedure for policy evaluation instead of system of equations
- Use final value function from the previous iteration Vπn-1 instead of arbitrary value initialization V0πn

Modified Policy Iteration

- Initialize π0 as a random proper policy
- Repeat
- Approximate Policy Evaluation: Compute Vπn-1
- by running only a few iterations of iterative policy evaluation

- Policy Improvement: Construct πn greedy w.r.t. Vπn-1

- Approximate Policy Evaluation: Compute Vπn-1
- Until πn == πn-1
- Return πn

Limitations of Policy Iteration

- Why do we need to start with a proper policy?
- Policy evaluation step will diverge

- How to get a proper policy?
- No domain independent algorithm

- Policy iteration for SSPs is not generically applicable

From Policy Iteration To Value Iteration

- Search space changes
- Policy Iteration
- Search over policies
- Compute the resulting value

- Value Iteration
- Search over values
- Compute the resulting policy

Bellman Equations

- Value Iteration based on set of Bellman equations
- Bellman equations mathematically express the optimal solution of an MDP
- Recursive expansion to compute optimal value function

Bellman Equations

- Optimal Q-value of a state-action pair: the minimum expected cost to reach a goal starting in state s if the agent’s first action is a
- Denoted as Q*(s,a)

- V*(s) = 0 (if s ∈ G) = mina ∈ AQ*(s,a) (s not ∈ G)
- Q*(s,a) = Ʃs’ ∈ SΤ(s,a,s’) [C(s,a,s’) + V*(s’)]
- Restatement of optimality principle for SSP MDPs

Bellman Equations

Expected cost to first execute action in state, and then follow optimal policy

- Q*(s,a) = Ʃs’ ∈ SΤ(s,a,s’) [C(s,a,s’) + V*(s’)]
- V*(s) = 0 (if s ∈ G) = mina ∈ AQ*(s,a) (s not ∈ G)
- The minimization over all actions makes the equations non-linear

Already at goal, don’t need to take action

Pick best action, minimize expected cost

Bellman Backup

- Iterative refinement
- Vn(s) mina ∈ AƩs’∈SΤ(s,a,s’) [C(s,a,s’) + Vn-1(s’)]
- Bellman Backup: computes a new value at state s by backing up the successor values V(s’)
- Uses Vn(s) mina ∈ AƩs’∈SΤ(s,a,s’) [C(s,a,s’) + Vn-1(s’)]

Example

- All costs are 1 except for a40=5 and a41=2
- All V0 values initialized as the distance from the goal
- Order of back up states is s0, s1, …, s4

Example – First Iteration

- V1(s0) = min{1+V0(s1), 1+V0(s2)} = 3, similar computations for s1, s2, s3
- Q1(s4,a41) = 2+0.6*0+0.4*2 = 2.8 and Q1(s4,a40) = 5
- V1(s4) = min{2.8, 5} = 2.8

VI - Convergence and Optimality

- Value Iteration converges to the optimal value in the limit without restrictions
- For an SSP MDP, ∀s ∈ S
- limn∞ Vn(s) = V*(s) irrespective of the initialization

VI - Termination

- Residual at state s (Bellman Backup): the magnitude of the change in the value of state s if Bellman backup is applied to V at s once
- Denoted ResV(s)
- Residual, ResV, is the maximum residual across all states

- ϵ-consistency: a state s is called ϵ-consistent w.r.t. a value function Vif the residual at s w.r.t. V is less than ϵ
- A value function V is ϵ-consistent if it is ϵ-consistent at all states

- Terminate VI if all residuals are small

VI - Running Time

- Each Bellman backup:
- Go over all states and all successors: O(|S||A|)

- Each Iteration:
- Backup all states: O(|S|2|A|)

- Number of iterations:
- General SSPs: non-trivial bounds don’t exist

Monotonicity

- For all n > k
- Vk≤p V* Vn≤p V* (Vn monotonic from below)
- Vk≥p V* Vn≥p V* (Vn monotonic from above)

- If a value function V1 is componentwise greater (or less) than another value function V2, then the same inequality holds true between T(V1) and T(V2)
- Bellman backup operator in VI is monotonic

Value Iteration to Asynchronous Value Iteration

- Value iteration requires full sweeps of state space
- It is not essential to backup all state in an iteration

- Asynchronous value iteration requires additional restriction so that no state is starved and convergence holds
- Termination condition checks if current value function is ϵ-consistent

Priority Backup

- There are wasteful backups that occur in value iteration
- Need to choose intelligent backup order and define a priority
- Higher priority states backed up earlier

What State to Prioritize?

- Avoid backing up a state when:
- None of the successors of the state have had a change in value since the last backup
- This means backing up the state will not change its value

Prioritized Sweeping

- If a state’s value changes prioritize its predecessors
- Estimate the expected change in the value of a state if a backup were to be performed
- Converges to optimal in limit if all initial priorities are non-zero

Generalized Prioritized Sweeping

- Instead of estimating residual, compute exact value as priority
- First backup, then push state in queue

Improved Prioritized Sweeping

- Low V(s) states (closer to goal) are higher priority initially
- As residual reduces for states closer to goal, priority of other states increase

Backward Value Iteration

- Prioritized Value Iteration without priority queue
- Backup states in reverse order starting from goal
- No overhead of priority queue and good information flow

Which priority algorithm to use?

- Synchronous Value Iteration: when states highly interconnected
- Prioritized Sweeping/Generalized Prioritize Sweeping: sequential dependencies
- Improved Prioritized Sweeping: specific way to tradeoff proximity to goal/info flow
- Backward Value Iteration: better for domains with fewer predecessors

Partitioned Value Iteration

- Partition the state space
- Need to stabilize mutual co-dependencies before focusing attention on states in other partitions

Benefits of Partitioning

- External-memory algorithms
- PEMVI

- Cache-efficient algorithms
- P-EVA algorithm

- Parallelized algorithms
- P3VI

Linear Programming for MDPs

- α(s) are the state-relevance weights
- For exact solution, unimportant and can be set to positive number (e.g. 1)

Linear Programming for MDPs

- |S| variables
- |S| |A| constraints
- Computing exact solution is slower than value iteration
- Better for specific kind of approximation where the value of a state is approximated with a sum of basis functions

Infinite-Horizon Discounted-Reward MDPs

- V*(s) = maxa ∈ A Ʃs’ ∈ SΤ(s,a,s’) [R(s,a,s’) + γV*(s’)]
- Value Iteration and Policy Iteration work even better than SSPs
- Policy Iteration does not require a “proper” policy
- Convergence stronger, bounds tighter
- Can bound number of iterations

Finite-Horizon MDPs

- V*(s,t) = 0 if t > L = maxa ∈ A Ʃs’ ∈ SΤ(s,a,s’) [R(s,a,s’) + V*(s’,t+1)]
- Finite-Horizon MDPs are acyclic
- There exists an optimal backup order
- t = Tmax to 0

- Returns optimal values (not just ϵ-consistent)
- Performs one backup per augmented state

- There exists an optimal backup order

MDPs with Dead Ends

- Dead-End State: a state s in S such that no policy can reach the goal from s in any number of time steps
- SSP MDPs unable to model domains with dead ends
- If allow dead-end states in value iteration, V*(s) not defined for dead-end state and diverges

Finite-Penalty SSP MDPs with Dead-Ends

- fSSPDE: a tuple <S, A, T, C, G, P>
- S, A, T, C, G are same as in SSP MDP
- P ∈ ℝ+ denotes the penalty incurred when an agent decides to abort the process in a non-goal state, under the condition:
- For every improper stationary deterministic Markovian policy π, for every s ∈ S where π is improper, the value of π at s under the expected linear additive utility without the possibility of stopping the process by paying the penalty is infinite

Comparison

- Policy Iteration
- Convergence dependent on initial policy being proper (unless IHDR)

- Value Iteration
- Doesn’t require initial proper policy
- For IHDR has stronger error bounds upon reaching ϵ-consistency

- Linear Programming
- Computing exact solution is slower than value iteration

Summary

- Policy Iteration
- Value Iteration
- Prioritizing and Partitioning Value Iteration
- Linear Programming (alternative solution)
- Special Cases:
- IHDR MDP and FH MDP

- Dead-End States

Download Presentation

Connecting to Server..