An introduction to markov decision processes sarah hickmott
1 / 12

An Introduction to Markov Decision Processes Sarah Hickmott - PowerPoint PPT Presentation

  • Uploaded on

An Introduction to Markov Decision Processes Sarah Hickmott. Probability Theory + Utility Theory = Decision Theory. Describes what an agent should believe based on evidence. Describes what an agent wants. Describes what an agent should do. Decision Theory. Markov Assumption:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'An Introduction to Markov Decision Processes Sarah Hickmott' - dudley

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Decision theory

Probability Theory


Utility Theory


Decision Theory

Describes what an agent should believe based on evidence.

Describes what an agent wants.

Describes what an agent should do.

Decision Theory

Markov assumption

Markov Assumption:

The next state’s conditional probability depends only on a finite history of previous states

kth order Markov Process

Markov Assumption

  • Andrei Markov (1913)

  • Markov Assumption:

    The next state’s conditional probability depends only on its immediately previous state

    1st order Markov Process

The definitions are equivalent!!!

Any algorithm that makes the 1st order Markov Assumption can be applied to any Markov Process

Markov decision process
Markov Decision Process

The specification of a sequential decision problem for a fully observable environment that satisfies the Markov Assumption and yields additive costs.

Markov decision process1
Markov Decision Process

An MDP has:

  • A set of states S = {s1 , s2 , … sN}

  • A set of actionsA = {a1 , a2 , … aM}

  • A real valued cost function g(s, a)

  • A transition probability function p(s’ | s, a)


    We will assume the stationary Markov transition property. This states that the effect of an action is independent of time


k indexes discrete time

xk is the state of the system at time k;

μk(xk)is the control variable to be selected given the system is in state xk at time k; μk : Sk → Ak

πis a policy; π = {μ0,,..., μN-1}

π* is the optimal policy

N is the horizon, or number of times the control is applied

xk+1 = f(xk , μk(xk) ) k=0…N-1


A policyis a mapping from states to actions

Following a policy:

1. Determine current state xk

2. Execute action μk(xk)

3. Repeat 1-2

Solution to an mdp
Solution to an MDP

The expected cost of a policy π = {μ0,,..., μN-1} starting at state state x0 is:

Goal: Find the policy π*which specifies which action to take in each state, so as to minimise the cost function.

This is encapsulated by Bellman’s Equation:

Assigning costs to sequences
Assigning Costs to Sequences

The objective cost function maps infinite sequences of costs to single real numbers


  • Set a finite horizon and simply add the costs

  • If the horizon is infinite, i.e. N → ∞, some possibilities are:

    • Discount to prefer earlier costs

    • Average the cost per stage

Mdp algorithms
MDP Algorithms

Value Iteration

For each state select any initial value Jo(s)


while k < maximum iterations

For each state sfind the action a that minimises the equation:

Then assign μ(s) = a

k = k+1


Mdp algorithms1
MDP Algorithms

Policy Iteration

Start with a randomly selected initial policy, then refine it repeatedly.

Value Determination:solve |S| simultaneous Bellman equations

Policy Improvement: for any state, if an action exists which reduces the current estimated cost, then change it in the policy.

Each step of Policy Iteration is computationally more expensive than Value Iteration. However Policy Iteration needs fewer steps to converge than Value Iteration.

Mdps and pns
MDPs and PNs

  • MDPs modeled by live Petri nets lead to Average Cost per Stage problems.

  • A policy is equivalent to a trace through the net

  • The aim is to use the finite prefix of an unfolding to derive decentralised Bellman’s equations, possibly associated with local configurations, and the communication between interacting parts.

  • Initially we will assume actions and their effects are deterministic.

  • Some work has been done unfolding Petri nets such that concurrent events are statistically independent.