1 / 27

# An Introduction to PO-MDP - PowerPoint PPT Presentation

An Introduction to PO-MDP. Presented by Alp Sardağ. MDP. Components: State Action Transition Reinforcement Problem: choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution Solution:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' An Introduction to PO-MDP' - phyre

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### An Introduction to PO-MDP

Presented by

Alp Sardağ

• Components:

• State

• Action

• Transition

• Reinforcement

• Problem:

• choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution

• Solution:

• Policy: value function

• Horizon length

• Value Iteration:

• Temporal Difference Learning:

Q(x,a)  Q(x,a) +(r+ maxbQ(y,b) - Q(x,a))

where  learning rate and  discount rate.

• Adding PO to CO-MDP is not trivial:

• Requires the complete observability of the state.

• PO clouds the current state.

• Components:

• States

• Actions

• Transitions

• Reinforcement

• Observations

• In CO-MDPs, mapping is from states to actions.

• In PO-MDPs, mapping is from probability distributions (over states) to actions.

• In a CO-MDP,

• Track our current state

• Update it after each action

• In a PO-MDP,

• Probability distribution over states

• Perform an action and make an observation, then update the distribution

• Belief State: probability distribution over states.

• Belief Space: the entire probability space.

• Example:

• Assume two state PO-MDP.

• P(s1) = p & P(s2) = 1-p.

• Line become hyper-plane in higher dimension.

s1

• Assumption:

• Finite action

• Finite observation

• Next belief state = T(cbf,a,o) where

cbf: current belief state, a:action, o:observation

• Finite number of possible next belief state

• The process is Markovian, the next belief state depends on:

• Current belief state

• Current action

• Observation

• Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.

• Using VI in continuous state space.

• No nice tabular representation as before.

• Restrictions on the form of the solutions to the continuous space CO-MDP:

• The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length.

• the value of a belief point is simply the dot product of the two vectors.

GOAL:for each iteration of value iteration, find a finite

number of linear segments that make up the value function

• Represent the value function for each horizon as a set of vectors.

• Overcome how to represent a value function over a continuous space.

• Find the vector that has the largest dot product with the belief state.

a2 is the best

a1 is the best

PO-MDP Value Iteration Example

• Assumption:

• Two states

• Two actions

• Three observations

• Ex: horizon length is 1.

b=[0.25 0.75]

a1 a2

]

[

s1

s2

• 0

• 0 1.5

V(a1,b) = 0.25x1+0.75x0 = 0.25

V(a2,b)=0.25x0+0.75x1.5=1.125

• The value of a belief state for horizon length 2 given b,a1,z1:

• immediate action plus the value of the next action.

• Find best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1.

• Find the value for all the belief points given this fixed action and observation.

• The Transformed value function is also PWLC.

• How to compute the value of a belief state given only the action?

• The horizon 2 value of the belief state, given that:

• Values for each observation: z1: 0.7 z2: 0.8 z3: 1.2

• P(z1| b,a1)=0.6; P(z2| b,a1)=0.25; P(z3| b,a1)=0.15

0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835

• Each of these transformed functions partitions the belief space differently.

• Best next action to perform depends upon the initial belief state and observation.

Best Value For Belief States

• The value of every single belief point, the sum of:

• Immediate reward.

• The line segments from the S() functions for each observation's future strategy.

• since adding lines gives you lines, it is linear.

Best Strategy for any Belief Points

• All the useful future strategies are easy to pick out:

Value Function and Partition

• For the specific action a1, the value function and corresponding partitions:

Value Function and Partition

• For the specific action a2, the value function and corresponding partitions:

• put the value functions for each action together to see where each action gives the highest value.

Compact Horizon 2 Value Function

Value Function for Action a1 with a Horizon of 3

Value Function for Action a2 with a Horizon of 3

Value Function for Both Action with a Horizon of 3

Value Function for Horizon of 3