An introduction to po mdp
Download
1 / 27

An Introduction to PO-MDP - PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on

An Introduction to PO-MDP. Presented by Alp Sardağ. MDP. Components: State Action Transition Reinforcement Problem: choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution Solution:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'An Introduction to PO-MDP' - phyre


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
An introduction to po mdp

An Introduction to PO-MDP

Presented by

Alp Sardağ


An introduction to po mdp
MDP

  • Components:

    • State

    • Action

    • Transition

    • Reinforcement

  • Problem:

    • choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution

  • Solution:

    • Policy: value function


Definition
Definition

  • Horizon length

  • Value Iteration:

    • Temporal Difference Learning:

      Q(x,a)  Q(x,a) +(r+ maxbQ(y,b) - Q(x,a))

      where  learning rate and  discount rate.

  • Adding PO to CO-MDP is not trivial:

    • Requires the complete observability of the state.

    • PO clouds the current state.


Po mdp
PO-MDP

  • Components:

    • States

    • Actions

    • Transitions

    • Reinforcement

    • Observations


Mapping in co mdp po mdp
Mapping in CO-MDP & PO-MDP

  • In CO-MDPs, mapping is from states to actions.

  • In PO-MDPs, mapping is from probability distributions (over states) to actions.


Vi in co mdp po mdp
VI in CO-MDP & PO-MDP

  • In a CO-MDP,

    • Track our current state

    • Update it after each action

  • In a PO-MDP,

    • Probability distribution over states

    • Perform an action and make an observation, then update the distribution


Belief state and space
Belief State and Space

  • Belief State: probability distribution over states.

  • Belief Space: the entire probability space.

  • Example:

    • Assume two state PO-MDP.

    • P(s1) = p & P(s2) = 1-p.

    • Line become hyper-plane in higher dimension.

s1


Belief transform
Belief Transform

  • Assumption:

    • Finite action

    • Finite observation

    • Next belief state = T(cbf,a,o) where

      cbf: current belief state, a:action, o:observation

  • Finite number of possible next belief state


Po mdp into continuous co mdp
PO-MDP into continuous CO-MDP

  • The process is Markovian, the next belief state depends on:

    • Current belief state

    • Current action

    • Observation

  • Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.


Problem
Problem

  • Using VI in continuous state space.

  • No nice tabular representation as before.


An introduction to po mdp
PWLC

  • Restrictions on the form of the solutions to the continuous space CO-MDP:

    • The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length.

    • the value of a belief point is simply the dot product of the two vectors.

GOAL:for each iteration of value iteration, find a finite

number of linear segments that make up the value function


Steps in vi
Steps in VI

  • Represent the value function for each horizon as a set of vectors.

    • Overcome how to represent a value function over a continuous space.

  • Find the vector that has the largest dot product with the belief state.


Po mdp value iteration example

a2 is the best

a1 is the best

PO-MDP Value Iteration Example

  • Assumption:

    • Two states

    • Two actions

    • Three observations

  • Ex: horizon length is 1.

b=[0.25 0.75]

a1 a2

]

[

s1

s2

  • 0

  • 0 1.5

V(a1,b) = 0.25x1+0.75x0 = 0.25

V(a2,b)=0.25x0+0.75x1.5=1.125


Po mdp value iteration example1
PO-MDP Value Iteration Example

  • The value of a belief state for horizon length 2 given b,a1,z1:

    • immediate action plus the value of the next action.

    • Find best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1.


Po mdp value iteration example2
PO-MDP Value Iteration Example

  • Find the value for all the belief points given this fixed action and observation.

  • The Transformed value function is also PWLC.


Po mdp value iteration example3
PO-MDP Value Iteration Example

  • How to compute the value of a belief state given only the action?

  • The horizon 2 value of the belief state, given that:

    • Values for each observation: z1: 0.7 z2: 0.8 z3: 1.2

    • P(z1| b,a1)=0.6; P(z2| b,a1)=0.25; P(z3| b,a1)=0.15

      0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835


Transformed value functions
Transformed Value Functions

  • Each of these transformed functions partitions the belief space differently.

  • Best next action to perform depends upon the initial belief state and observation.


B est v alue for b elief s tate s
Best Value For Belief States

  • The value of every single belief point, the sum of:

    • Immediate reward.

    • The line segments from the S() functions for each observation's future strategy.

  • since adding lines gives you lines, it is linear.


B est s trategy for any b elief p oints
Best Strategy for any Belief Points

  • All the useful future strategies are easy to pick out:


Value f unction and p artition
Value Function and Partition

  • For the specific action a1, the value function and corresponding partitions:


Value f unction and p artition1
Value Function and Partition

  • For the specific action a2, the value function and corresponding partitions:


Which action to choose
Which Action to Choose?

  • put the value functions for each action together to see where each action gives the highest value.


C ompact h orizon 2 v alue f unction
Compact Horizon 2 Value Function


V alue f unction for a ction a 1 with a h orizon of 3
Value Function for Action a1 with a Horizon of 3


V alue f unction for a ction a 2 with a h orizon of 3
Value Function for Action a2 with a Horizon of 3


V alue f unction for both a ction with a h orizon of 3
Value Function for Both Action with a Horizon of 3


V alue f unction for h orizon of 3
Value Function for Horizon of 3