1 / 17

# Markov Decision Process (MDP) - PowerPoint PPT Presentation

Markov Decision Process (MDP). S : A set of states A : A set of actions P r(s’|s,a): transition model (aka M a s,s’ ) C (s,a,s’): cost model G : set of goals s 0 : start state  : discount factor R ( s,a,s’): reward model. Value function: expected long term reward from

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Markov Decision Process (MDP)' - zahi

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

S: A set of states

A: A set of actions

Pr(s’|s,a): transition model

(aka Mas,s’)

C(s,a,s’): cost model

G: set of goals

s0: start state

: discount factor

R(s,a,s’): reward model

Value function: expected

long term reward from

the state

Q values: Expected long

term reward of doing a

in s

V(s) = max Q(s,a)

Greedy Policy w.r.t.

a value function

Value of a policy

Optimal value function

Goal-directed, Indefinite Horizon, Cost Minimization MDP

<S, A, Pr, C, G, s0>

Most often studied in planning community

Infinite Horizon, Discounted Reward Maximization MDP

<S, A, Pr, R, >

Most often studied in reinforcement learning

Goal-directed, Finite Horizon, Prob. Maximization MDP

<S, A, Pr, G, s0, T>

Also studied in planning community

Oversubscription Planning: Non absorbing goals, Reward Max. MDP

<S, A, Pr, G, R, s0>

Relatively recent model

MDPs don’t have a notion of an “initial” and “goal” state. (Process orientation instead of “task” orientation)

Goals are sort of modeled by reward functions

Allows pretty expressive goals (in theory)

Normal MDP algorithms don’t use initial state information (since policy is supposed to cover the entire search space anyway).

Could consider “envelope extension” methods

Compute a “deterministic” plan (which gives the policy for some of the states; Extend the policy to other states that are likely to happen during execution

RTDP methods

SSSP are a special case of MDPs where

(a) initial state is given

(b) there are absorbing goal states

(c) Actions have costs. All states have zero rewards

A proper policy for SSSP is a policy which is guaranteed to ultimately put the agent in one of the absorbing states

For SSSP, it would be worth finding a partial policy that only covers the “relevant” states (states that are reachable from init and goal states on any optimal policy)

Value/Policy Iteration don’t consider the notion of relevance

Consider “heuristic state search” algorithms

Heuristic can be seen as the “estimate” of the value of a state.

SSPP—Stochastic Shortest Path Problem An MDP with Init and Goal states

< “goal” state. (Process orientation instead of “task” orientation)S, A, Pr, C, G, s0>

Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state.

J* should satisfy the following equation:

Bellman Equations for Cost Minimization MDP(absorbing goals)[also called Stochastic Shortest Path]

Q*(s,a)

< “goal” state. (Process orientation instead of “task” orientation)S, A, Pr, R, s0, >

Define V*(s) {optimal value} as the maximum expected discounted reward from this state.

V* should satisfy the following equation:

Bellman Equations for infinite horizon discounted reward maximization MDP

< “goal” state. (Process orientation instead of “task” orientation)S, A, Pr, G, s0, T>

Define P*(s,t) {optimal prob.} as the maximum probability of reaching a goal from this state at tth timestep.

P* should satisfy the following equation:

Bellman Equations for probability maximization MDP

Modeling Softgoal problems as deterministic MDPs “goal” state. (Process orientation instead of “task” orientation)

• Consider the net-benefit problem, where actions have costs, and goals have utilities, and we want a plan with the highest net benefit

• How do we model this as MDP?

• (wrong idea): Make every state in which any subset of goals hold into a sink state with reward equal to the cumulative sum of utilities of the goals.

• Problem—what if achieving g1 g2 will necessarily lead you through a state where g1 is already true?

• (correct version): Make a new fluent called “done” dummy action called Done-Deal. It is applicable in any state and asserts the fluent “done”. All “done” states are sink states. Their reward is equal to sum of rewards of the individual states.

An eye for an eye only ends up making the whole world blind. -Mohandas Karamchand Gandhi, born October 2nd, 1869.

Lecture of October 2nd, 2009

LAO*, RTDP

Use execution and/or Simulation

“Actual Execution” Reinforcement learning

(Main motivation for RL is to “learn” the model)

“Simulation” –simulate the given model to sample possible futures

Policy rollout, hindsight optimization etc.

Use “factored” representations

Factored representations for Actions, Reward Functions, Values and Policies

Directly manipulating factored representations during the Bellman update

Ideas for Efficient Algorithms..

Set the value of a state in terms of the maximum expected value achievable by doing actions from that state.

They do the update for every statein the state space

Wasteful if we know the initial state(s) that the agent is starting from

Heuristic search (e.g. A*/AO*) explores only the part of the state space that is actually reachable from the initial state

Even within the reachable space, heuristic search can avoid visiting many of the states.

Depending on the quality of the heuristic used..

But what is the heuristic?

An admissible heuristic is a lowerbound on the cost to reach goal from any given state

It is a lowerbound on V*!

Heuristic Search vs. Dynamic Programming (Value/Policy Iteration)

Real Time Dynamic Programming [Barto, Bradtke, Singh’95]

Trial: simulate greedy policy starting from start state;

perform Bellman backup on visited states

RTDP: repeat Trials until cost function converges

RTDP was originally introduced for

Reinforcement Learning

For RL, instead of “simulate” you “execute”

You also have to do “exploration” in addition

to “exploitation”

 with probability p, follow the greedy policy

with 1-p pick a random action

What if we

simulate the

action’s effect

with noise

(rather than

exactly wrt

its transition

probabilities)

Min

s0

Note that the value function is being updated

per each level. How about waiting until you hit

goal and then update everyone?

Jn

Qn+1(s0,a)

agreedy = a2

Jn

?

a1

Jn

Goal

a2

?

Jn+1(s0)

Jn

a3

?

Jn

Jn

Jn

Greedy “On-Policy” RTDP without execution

Using the current utility values, select the

action with the highest expected utility

(greedy action) at each state, until you

reach a terminating state. Update the

values along this path. Loop back—until the

values stabilize

Labeled RTDP [Bonet&Geffner’03]

Initialise J0 with an admissible heuristic

⇒Jn monotonically increases

Label a state as solved

if the Jn for that state has converged

Backpropagate ‘solved’ labeling

Stop trials when they reach any solved state

Terminate when s0 is solved

high Q costs

s

?

G

t

best action

) J(s) won’t change!

high Q costs

s

G

both s and t

get solved together

heuristic-guided

explores a subset of reachable state space

anytime

focusses attention on more probable states

fast convergence

focusses attention on unconverged states

terminates in finite time

Ordering the Bellman backups to maximise information flow.

[Wingate & Seppi’05]

[Dai & Hansen’07]

Partition the state space and combine value iterations from different partitions.

[Wingate & Seppi’05]

[Dai & Goldsmith’07]

External memory version of value iteration

[Edelkamp, Jabbar & Bonet’07]