270 likes | 380 Views
This presentation by Alp Sardağ provides an in-depth introduction to Partially Observable Markov Decision Processes (PO-MDPs). It covers the essential components of PO-MDPs, including states, actions, transitions, and reinforcement. The presentation discusses the challenges of achieving complete observability, belief states, and value iteration techniques such as Temporal Difference Learning. With examples illustrating the value functions and decision-making strategies in discrete and continuous spaces, this resource is invaluable for understanding the complexities of PO-MDPs and their applications in decision-making problems.
E N D
An Introduction to PO-MDP Presented by Alp Sardağ
MDP • Components: • State • Action • Transition • Reinforcement • Problem: • choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution • Solution: • Policy: value function
Definition • Horizon length • Value Iteration: • Temporal Difference Learning: Q(x,a) Q(x,a) +(r+ maxbQ(y,b) - Q(x,a)) where learning rate and discount rate. • Adding PO to CO-MDP is not trivial: • Requires the complete observability of the state. • PO clouds the current state.
PO-MDP • Components: • States • Actions • Transitions • Reinforcement • Observations
Mapping in CO-MDP & PO-MDP • In CO-MDPs, mapping is from states to actions. • In PO-MDPs, mapping is from probability distributions (over states) to actions.
VI in CO-MDP & PO-MDP • In a CO-MDP, • Track our current state • Update it after each action • In a PO-MDP, • Probability distribution over states • Perform an action and make an observation, then update the distribution
Belief State and Space • Belief State: probability distribution over states. • Belief Space: the entire probability space. • Example: • Assume two state PO-MDP. • P(s1) = p & P(s2) = 1-p. • Line become hyper-plane in higher dimension. s1
Belief Transform • Assumption: • Finite action • Finite observation • Next belief state = T(cbf,a,o) where cbf: current belief state, a:action, o:observation • Finite number of possible next belief state
PO-MDP into continuous CO-MDP • The process is Markovian, the next belief state depends on: • Current belief state • Current action • Observation • Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.
Problem • Using VI in continuous state space. • No nice tabular representation as before.
PWLC • Restrictions on the form of the solutions to the continuous space CO-MDP: • The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. • the value of a belief point is simply the dot product of the two vectors. GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function
Steps in VI • Represent the value function for each horizon as a set of vectors. • Overcome how to represent a value function over a continuous space. • Find the vector that has the largest dot product with the belief state.
a2 is the best a1 is the best PO-MDP Value Iteration Example • Assumption: • Two states • Two actions • Three observations • Ex: horizon length is 1. b=[0.25 0.75] a1 a2 ] [ s1 s2 • 0 • 0 1.5 V(a1,b) = 0.25x1+0.75x0 = 0.25 V(a2,b)=0.25x0+0.75x1.5=1.125
PO-MDP Value Iteration Example • The value of a belief state for horizon length 2 given b,a1,z1: • immediate action plus the value of the next action. • Find best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1.
PO-MDP Value Iteration Example • Find the value for all the belief points given this fixed action and observation. • The Transformed value function is also PWLC.
PO-MDP Value Iteration Example • How to compute the value of a belief state given only the action? • The horizon 2 value of the belief state, given that: • Values for each observation: z1: 0.7 z2: 0.8 z3: 1.2 • P(z1| b,a1)=0.6; P(z2| b,a1)=0.25; P(z3| b,a1)=0.15 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835
Transformed Value Functions • Each of these transformed functions partitions the belief space differently. • Best next action to perform depends upon the initial belief state and observation.
Best Value For Belief States • The value of every single belief point, the sum of: • Immediate reward. • The line segments from the S() functions for each observation's future strategy. • since adding lines gives you lines, it is linear.
Best Strategy for any Belief Points • All the useful future strategies are easy to pick out:
Value Function and Partition • For the specific action a1, the value function and corresponding partitions:
Value Function and Partition • For the specific action a2, the value function and corresponding partitions:
Which Action to Choose? • put the value functions for each action together to see where each action gives the highest value.