Markov Decision Processes. * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld. Atomic Model for stochastic environments with generalized rewards. Deterministic worlds + goals of attainment. Stochastic worlds +generalized rewards.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Markov Decision Processes
* Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld
Deterministic worlds + goals of attainment
Stochastic worlds +generalized rewards
An action can take you to any of a set of states with known probability
You get rewards for visiting each state
Objective is to increase your “cumulative” reward…
What is the solution?




Next state could be one of a set of states.
Next state is drawn from a probability distribution over the set of states.
How are these models related?
At+1
At
St
St+2
St+1
Rt+2
Rt+1
Rt
If you are 20 and are not a liberal, you are heartless
If you are 40 and not a conservative, you are mindless
Churchill
Why not just consider sequences of actions?
Why not just replan?
(a)
(b)
immediate reward
expected future payoffwith k1 stages to go
π(s,k)
0.7
What is time complexity?
0.3
Vk
Vk1
ComputeExpectations
Vt
s1
ComputeMax
s2
s3
s4
0.7 Vt (s1) + 0.3 Vt (s4)
Vt+1(s) = R(s)+max {
0.4 Vt (s2) + 0.6 Vt(s3)
}
How can we compute optimal Vt+1(s) given optimal Vt ?
0.7
a1
0.3
Vt+1(s)
s
0.4
a2
0.6
Bellman backup
Vk is optimal kstagetogo value function
Π*(s,k) is optimal kstagetogo policy
V1
V3
V2
V0
s1
s2
0.7
0.7
0.7
0.4
0.4
0.4
s3
0.6
0.6
0.6
0.3
0.3
0.3
s4
V1(s4) = R(s4)+max {
0.7 V0 (s1) + 0.3 V0 (s4)
0.4 V0 (s2) + 0.6 V0(s3)
}
Optimal value depends on stagestogo
(independent in the infinite horizon case)
V1
V3
V2
V0
s1
s2
0.7
0.7
0.7
0.4
0.4
0.4
s3
0.6
0.6
0.6
0.3
0.3
0.3
s4
P*(s4,t) = max { }
1. Choose a random policy π
2. Loop:
(a) Evaluate Vπ
(b) For each s in S, set
(c) Replace π with π’
Until no improving action possible at any state
Policy improvement
Goaldirected, Indefinite Horizon, Cost Minimization MDP
<S, A, Pr, C, G, s0>
Most often studied in planning community
Infinite Horizon, Discounted Reward Maximization MDP
<S, A, Pr, R, >
Most often studied in reinforcement learning
Goaldirected, Finite Horizon, Prob. Maximization MDP
<S, A, Pr, G, s0, T>
Also studied in planning community
Oversubscription Planning: Non absorbing goals, Reward Max. MDP
<S, A, Pr, G, R, s0>
Relatively recent model
MDPs don’t have a notion of an “initial” and “goal” state. (Process orientation instead of “task” orientation)
Goals are sort of modeled by reward functions
Allows pretty expressive goals (in theory)
Normal MDP algorithms don’t use initial state information (since policy is supposed to cover the entire search space anyway).
Could consider “envelope extension” methods
Compute a “deterministic” plan (which gives the policy for some of the states; Extend the policy to other states that are likely to happen during execution
RTDP methods
SSSP are a special case of MDPs where
(a) initial state is given
(b) there are absorbing goal states
(c) Actions have costs. All states have zero rewards
A proper policy for SSSP is a policy which is guaranteed to ultimately put the agent in one of the absorbing states
For SSSP, it would be worth finding a partial policy that only covers the “relevant” states (states that are reachable from init and goal states on any optimal policy)
Value/Policy Iteration don’t consider the notion of relevance
Consider “heuristic state search” algorithms
Heuristic can be seen as the “estimate” of the value of a state.
<S, A, Pr, C, G, s0>
Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state.
J* should satisfy the following equation:
Q*(s,a)
<S, A, Pr, R, s0, >
Define V*(s) {optimal value} as the maximum expected discounted reward from this state.
V* should satisfy the following equation:
<S, A, Pr, G, s0, T>
Define P*(s,t) {optimal prob.} as the maximum probability of reaching a goal from this state at tth timestep.
P* should satisfy the following equation:
VI and PI approaches use Dynamic Programming Update
Set the value of a state in terms of the maximum expected value achievable by doing actions from that state.
They do the update for every statein the state space
Wasteful if we know the initial state(s) that the agent is starting from
Heuristic search (e.g. A*/AO*) explores only the part of the state space that is actually reachable from the initial state
Even within the reachable space, heuristic search can avoid visiting many of the states.
Depending on the quality of the heuristic used..
But what is the heuristic?
An admissible heuristic is a lowerbound on the cost to reach goal from any given state
It is a lowerbound on J*!