Create Presentation
Download Presentation

Download Presentation
## Partially Observable MDP

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Partially Observable MDP**Automated Planning and Decision Making 2007**MDP = Perfect Observation**• Basic assumption: we know the state of the world at each stage • In essence: we have perfect sensors • Typically: we have imperfect sensors we can only have partial information about the state • When we have imperfect information we sometimes take actions simply to gain information Automated Planning and Decision Making**POMDP**• <S, A, Tr, R, Ω, O> • S – State space. • A – Actions set. • Tr - SxA→[S]. State space over S. • Tr(s, a, s’) = p - The probability to reach s’ from s using a.s – the state before.a – action.s’ – the state after. • R - SxA→ℝ. The reward for doing aA at state S. • Ω- set of possible observations. • O - SxA→(Ω). • O(s, a, o) = p - The probability of observing oΩ after performing a in s.OR the probability to observe oΩafter doing a and reaching s. Automated Planning and Decision Making**POMDPValue of Information**• A robot with a wall sensor starts at one of the I with same probability. • Following a move, it sense the walls around it • By moving up, observed walls configuration will be the same for both options. • By moving down, we get different configuration. Automated Planning and Decision Making**Solving a POMDP**• As in MDP, because of uncertainty, we need a policy, not a plan • But what does the policy depend on – we don’t know the state? • Option 1: History • How much history do we need to remember? • How big is our policy • Problem: Highly non-uniform, hard to work with Automated Planning and Decision Making**POMDP**s0 b a1 a2 a1 a2 s1 s2 s3 s4 o1 o2 o3 A much harder tree – Each state is different then the other, as it is based on different actions and observations. o1 o2 o3 o4 o5 o6 The history of observations defines a state Automated Planning and Decision Making**Option 2: Belief State**• What matters about the future is the current state • We don’t know what the current state is • Instead, we can maintain a probability distribution over the current state • Called the belief state Automated Planning and Decision Making**POMDPBelief State**b0 a1 a2 Evaluated using action and observations b1 b2 b3 b4 a2 a1 b5 b6 b7 b8 • How do we compute the next belief state? Automated Planning and Decision Making**POMDPUpdating the belief state**• Let b be the current belief state. • We calculate b`, the belief state resulted from b by applying a and observing o. • b(s) = the probability of s according to b. = = Normalizing factor. Ignore it in the calculations, and normalize to 1 later. , Pr(x)=yYPr(x|y)•Pr(y) Bayes Rue: Automated Planning and Decision Making**POMDP MDP**• We can reduce the pomdp to an MDP over belief states • State = belief states • Actions = same actions • R(b,a) = Σsb(s)*R(s,a) • (b,a,b`)=Pr(b`|a,b)=oΩPr(b`|a,o,b)•Pr(o|a,b) Automated Planning and Decision Making**R(s1,a)**R(s2,a`) R(s1,a`) R(s2,a) Average (linear func.) b(s1) = p b(s2) = 1-p p•R(s1,a)+(1-p)•R(s2,a) 0.3 Policy switch point POMDPBelief State MDP’s Value Function • At every belief state, choose the action that maximize the value. • The best value is v*(a). Automated Planning and Decision Making**R(s1,a)**a p s1 q Pr(s1)=p Pr(s2)=q Pr(s3)=1-p-q R(s2,a) s2 POMDPBelief State MDP • For more then one action – v*(b) is the expected value. • vnρ=R(s1,ρ(s1)) + oΩPr(o|s1,a) • vn-1 ρ/o(b0a) Automated Planning and Decision Making**POMDPBelief State MDP**a(ρ) ρ: • αa = [R(s1,a), R(s2,a)] in our example. • αρ = … vector of size |S| where each state has the value of the prize for a. • va(b)=b•αa • vρ(b)=b•αρ • ρ1…n are policies of length m. • P={α1, …, αn} • ρ*p(b)=argmax αP αib, i{1..n} • v*p(b)=maxαP αb • vρ(b)=sSb(s)•r(s,a) + δoΩPr(o|b,a)•vρ/o(bao) ok o1 ρo-1 ρo-k … the policy at the sub tree of ρ matching o immediate prize for applying the 1st action resulted belief state for applying a at b and observing o Automated Planning and Decision Making**POMDPValue Iteration for belief states**• init: aA, v1a(b)=sSb(s)R(s,a) • build a value function for k+1 steps, given the functions for k steps. 1 let ρ1..ρn be the possible policies from depth k that are not dominated by the rest of the policies from depth k (exists a belief state b for which they are optimal). 2 Build all the policies trees from depth k of the form: 3 Calculate vρ(b)=sSb(s)•r(s,a)+δoΩPr(o|b,a)•vρ/o(bao) for each of the trees. aA O1 Ok Where i{1..n} ρi1 ρik … Automated Planning and Decision Making**POMDPPoint Based Value Iteration**• Idea: maintain a fixed size set α1, …, αn of vectors. Each vector αi matches a belief statebi and action a(αi). • maxi[1..n]b•αi • Advantage: Num of vectors is bounded by n. • Disadvantage: Only approximation to optimality. α1 α2 Automated Planning and Decision Making**POMDPPoint Based Value Iteration**• The method: • Same as in Value-Iteration, but initialize the α`s to match the optimal actions for b1, …, bn. • At the iterative part – build all trees for k+1 given the functions α1, …, αnof the kth step. • use the value function only from step k. • keep only the new policies and the matching αi, which are optimal for bi. • Number of possible trees: |A|n|Ω| Automated Planning and Decision Making**POMDPRepresenting policy as an automata**• Automata + Initial state policy • The idea: • Based on solution to m vector equations. • To every state of the automata match a value function represented by the correspondingα-vector. • For every state of the automata, evaluate the best value function under the assumption that after executing the first action, we continue to one of the value function evaluated above. Automated Planning and Decision Making