an introduction to po mdp
Skip this Video
Download Presentation
An Introduction to PO-MDP

Loading in 2 Seconds...

play fullscreen
1 / 27

An Introduction to PO-MDP - PowerPoint PPT Presentation

  • Uploaded on

An Introduction to PO-MDP. Presented by Alp Sardağ. MDP. Components: State Action Transition Reinforcement Problem: choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution Solution:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' An Introduction to PO-MDP' - phyre

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
an introduction to po mdp

An Introduction to PO-MDP

Presented by

Alp Sardağ

  • Components:
    • State
    • Action
    • Transition
    • Reinforcement
  • Problem:
    • choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution
  • Solution:
    • Policy: value function
  • Horizon length
  • Value Iteration:
    • Temporal Difference Learning:

Q(x,a)  Q(x,a) +(r+ maxbQ(y,b) - Q(x,a))

where  learning rate and  discount rate.

  • Adding PO to CO-MDP is not trivial:
    • Requires the complete observability of the state.
    • PO clouds the current state.
po mdp
  • Components:
    • States
    • Actions
    • Transitions
    • Reinforcement
    • Observations
mapping in co mdp po mdp
Mapping in CO-MDP & PO-MDP
  • In CO-MDPs, mapping is from states to actions.
  • In PO-MDPs, mapping is from probability distributions (over states) to actions.
vi in co mdp po mdp
  • In a CO-MDP,
    • Track our current state
    • Update it after each action
  • In a PO-MDP,
    • Probability distribution over states
    • Perform an action and make an observation, then update the distribution
belief state and space
Belief State and Space
  • Belief State: probability distribution over states.
  • Belief Space: the entire probability space.
  • Example:
    • Assume two state PO-MDP.
    • P(s1) = p & P(s2) = 1-p.
    • Line become hyper-plane in higher dimension.


belief transform
Belief Transform
  • Assumption:
    • Finite action
    • Finite observation
    • Next belief state = T(cbf,a,o) where

cbf: current belief state, a:action, o:observation

  • Finite number of possible next belief state
po mdp into continuous co mdp
PO-MDP into continuous CO-MDP
  • The process is Markovian, the next belief state depends on:
    • Current belief state
    • Current action
    • Observation
  • Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.
  • Using VI in continuous state space.
  • No nice tabular representation as before.
  • Restrictions on the form of the solutions to the continuous space CO-MDP:
    • The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length.
    • the value of a belief point is simply the dot product of the two vectors.

GOAL:for each iteration of value iteration, find a finite

number of linear segments that make up the value function

steps in vi
Steps in VI
  • Represent the value function for each horizon as a set of vectors.
    • Overcome how to represent a value function over a continuous space.
  • Find the vector that has the largest dot product with the belief state.
po mdp value iteration example

a2 is the best

a1 is the best

PO-MDP Value Iteration Example
  • Assumption:
    • Two states
    • Two actions
    • Three observations
  • Ex: horizon length is 1.

b=[0.25 0.75]

a1 a2





  • 0
  • 0 1.5

V(a1,b) = 0.25x1+0.75x0 = 0.25


po mdp value iteration example1
PO-MDP Value Iteration Example
  • The value of a belief state for horizon length 2 given b,a1,z1:
    • immediate action plus the value of the next action.
    • Find best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1.
po mdp value iteration example2
PO-MDP Value Iteration Example
  • Find the value for all the belief points given this fixed action and observation.
  • The Transformed value function is also PWLC.
po mdp value iteration example3
PO-MDP Value Iteration Example
  • How to compute the value of a belief state given only the action?
  • The horizon 2 value of the belief state, given that:
    • Values for each observation: z1: 0.7 z2: 0.8 z3: 1.2
    • P(z1| b,a1)=0.6; P(z2| b,a1)=0.25; P(z3| b,a1)=0.15

0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835

transformed value functions
Transformed Value Functions
  • Each of these transformed functions partitions the belief space differently.
  • Best next action to perform depends upon the initial belief state and observation.
b est v alue for b elief s tate s
Best Value For Belief States
  • The value of every single belief point, the sum of:
    • Immediate reward.
    • The line segments from the S() functions for each observation\'s future strategy.
  • since adding lines gives you lines, it is linear.
b est s trategy for any b elief p oints
Best Strategy for any Belief Points
  • All the useful future strategies are easy to pick out:
value f unction and p artition
Value Function and Partition
  • For the specific action a1, the value function and corresponding partitions:
value f unction and p artition1
Value Function and Partition
  • For the specific action a2, the value function and corresponding partitions:
which action to choose
Which Action to Choose?
  • put the value functions for each action together to see where each action gives the highest value.