- By
**phyre** - Follow User

- 66 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' An Introduction to PO-MDP' - phyre

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

MDP

- Components:
- State
- Action
- Transition
- Reinforcement
- Problem:
- choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution
- Solution:
- Policy: value function

Definition

- Horizon length
- Value Iteration:
- Temporal Difference Learning:

Q(x,a) Q(x,a) +(r+ maxbQ(y,b) - Q(x,a))

where learning rate and discount rate.

- Adding PO to CO-MDP is not trivial:
- Requires the complete observability of the state.
- PO clouds the current state.

PO-MDP

- Components:
- States
- Actions
- Transitions
- Reinforcement
- Observations

Mapping in CO-MDP & PO-MDP

- In CO-MDPs, mapping is from states to actions.
- In PO-MDPs, mapping is from probability distributions (over states) to actions.

VI in CO-MDP & PO-MDP

- In a CO-MDP,
- Track our current state
- Update it after each action
- In a PO-MDP,
- Probability distribution over states
- Perform an action and make an observation, then update the distribution

Belief State and Space

- Belief State: probability distribution over states.
- Belief Space: the entire probability space.
- Example:
- Assume two state PO-MDP.
- P(s1) = p & P(s2) = 1-p.
- Line become hyper-plane in higher dimension.

s1

Belief Transform

- Assumption:
- Finite action
- Finite observation
- Next belief state = T(cbf,a,o) where

cbf: current belief state, a:action, o:observation

- Finite number of possible next belief state

PO-MDP into continuous CO-MDP

- The process is Markovian, the next belief state depends on:
- Current belief state
- Current action
- Observation
- Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.

Problem

- Using VI in continuous state space.
- No nice tabular representation as before.

PWLC

- Restrictions on the form of the solutions to the continuous space CO-MDP:
- The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length.
- the value of a belief point is simply the dot product of the two vectors.

GOAL:for each iteration of value iteration, find a finite

number of linear segments that make up the value function

Steps in VI

- Represent the value function for each horizon as a set of vectors.
- Overcome how to represent a value function over a continuous space.
- Find the vector that has the largest dot product with the belief state.

a1 is the best

PO-MDP Value Iteration Example- Assumption:
- Two states
- Two actions
- Three observations
- Ex: horizon length is 1.

b=[0.25 0.75]

a1 a2

]

[

s1

s2

- 0
- 0 1.5

V(a1,b) = 0.25x1+0.75x0 = 0.25

V(a2,b)=0.25x0+0.75x1.5=1.125

PO-MDP Value Iteration Example

- The value of a belief state for horizon length 2 given b,a1,z1:
- immediate action plus the value of the next action.
- Find best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1.

PO-MDP Value Iteration Example

- Find the value for all the belief points given this fixed action and observation.
- The Transformed value function is also PWLC.

PO-MDP Value Iteration Example

- How to compute the value of a belief state given only the action?
- The horizon 2 value of the belief state, given that:
- Values for each observation: z1: 0.7 z2: 0.8 z3: 1.2
- P(z1| b,a1)=0.6; P(z2| b,a1)=0.25; P(z3| b,a1)=0.15

0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835

Transformed Value Functions

- Each of these transformed functions partitions the belief space differently.
- Best next action to perform depends upon the initial belief state and observation.

Best Value For Belief States

- The value of every single belief point, the sum of:
- Immediate reward.
- The line segments from the S() functions for each observation\'s future strategy.
- since adding lines gives you lines, it is linear.

Best Strategy for any Belief Points

- All the useful future strategies are easy to pick out:

Value Function and Partition

- For the specific action a1, the value function and corresponding partitions:

Value Function and Partition

- For the specific action a2, the value function and corresponding partitions:

Which Action to Choose?

- put the value functions for each action together to see where each action gives the highest value.

Download Presentation

Connecting to Server..