1 / 25

A Tutorial on the Partially Observable Markov Decision Process and Its Applications

A Tutorial on the Partially Observable Markov Decision Process and Its Applications. Lawrence Carin June 7,2006. Outline. Overview of Markov decision Processes (MDPs) Introduction to partially observable decision processes (POMDPs) Some applications of POMDPs Conclusions.

sheera
Download Presentation

A Tutorial on the Partially Observable Markov Decision Process and Its Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006

  2. Outline • Overview of Markov decision Processes (MDPs) • Introduction to partially observable decision processes (POMDPs) • Some applications of POMDPs • Conclusions

  3. Overview of MDPs • Introduction to POMDPs model • Some applications of POMDPs • Conclusions

  4. Markov decision processes The MDP is defined by the tuple < S, A, T, R > • S is a finite set of states of the world. • A is a finite set of actions. • T: SA  (S) is the state-transition function, the probability of an action changing the the world state from one to another,T(s, a, s’). • R: SA   is the reward for the agent in a given world state after performing an action, R(s, a).

  5. WORLD: T(s,a, s’) State s Action a AGENT Markov decision processes Two properties of the MDP • The action-dependent state transition is Markovian • The state is fully observable after taking action a Illustration of MDPs

  6. Markov decision processes Objective of MDPs • Finding the optimal policy , mapping state s to action • a in order to maximize the value function V(s).

  7. Overview of MDPs • Introduction to POMDPs • Some applications of POMDPs • Conclusions

  8. Introduction to POMDPs The POMDP is defined by the tuple < S, A, T, R, , O> • S, A, T, and R are defined the same as in MDPs • is a finite set of observations the agent can experience its world. • O: SA  () is the observation function, the probability of making a certain observation after performing a particular action, landing in state s’, O(s’, a, o).

  9. Introduction to POMDPs Differences between MDPs and POMDPs • The state is hidden after taking action a. • The hidden state information is inferred from the action-state dependent observation function O(s’, a, o). Uncertainty of state s in POMDPs

  10. Introduction to POMDPs A new concept in POMDPs: Belief State b(s) b(st) = Pr(st= s| o1, a1, o2, a2, …, ot-1, at-1, ot)

  11. s1 s1 o1 o2 b’=T(b|a, o1) b b’=T(b|a, o2) s2 s2 s3 s3 n-1 control interval remaining n control interval remaining Introduction to POMDPs The belief state b(s)evolves according to Bayes rule (1)

  12. WORLD: T(s,a, s’) O(s’, a, o) Action a Observation o AGENT b SE: Introduction to POMDPs Illustration of POMDPs SE: State Estimator using (1) : Policy Search

  13. Expected immediate reward Introduction to POMDPs Objective of POMDPs • Finding the optimal policy  for POMDPs, mapping • belief point b to action a in order to maximize the • value function V(b). (2)

  14. p1 V(b) p2 p3 p4 p5 Pr(S1) a(p1) a(p2) a(p5) 1 0 Introduction to POMDPs • Piecewise linearity and convexity of optimal • value function for finite horizon in POMDPs (3) Optimal value function

  15. Introduction to POMDPs Substituting (3), (1) into (2) Maximizing to obtain the index l (4) -vector of belief point b Optimal value of belief point b

  16. Introduction to POMDPs Approaches to solving POMDPs problem • Exact algorithms: finding all -vectors for the whole • belief space which is exact but intractable for large • size problems. • Approximate algorithms: finding -vectors of a • subset of the belief space, which is fast and can deal • with large size problems.

  17. Point-Based Value Iteration Point-based value iteration (PBVI) b0 b1 b3 b4 b5 • focus on a finite set of belief points • maintain an -vector for each point

  18. Region-Based Value Iteration (RBVI) • RBVI maintains an -vector for each convex region over which the optimal value function is linear. • RBVI simultaneously determines the -vectors for all relevant convex regions based on all available belief points.

  19. RBVI (Contd) The piecewise linear value function: which can be reformulated as by introducing hidden variables z(b)=k, denoting bBk

  20. RBVI (Contd) The belief space is partitioned using hyper-ellipsoids, Then we have

  21. RBVI (Contd) The joint distribution of V(b) and b can be written as where Expectation-Maximization (EM) Estimation: E step: M step:

  22. Overview of MDPs • Introduction to POMDPs model • Some applications of POMDPs • Conclusions

  23. Applications of POMDPs • Application of Partially Observable Markov Decision • Processes to robot navigation in a Minefield • Application of Partially Observable Markov Decision • Processes to feature selection • Application of Partially Observable Markov Decision • Processes to sensor scheduling

  24. Applications of POMDPs Some considerations in applying POMDPs to new problems • How to define the state • How to obtain the transition and observation matrix • How to set the reward

  25. References • Leslie Pack Kaelbling, Michael L. Littman and Anthony R. Cassandra. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, Vol. 101,1998. • Smallwood, R. D., and Sondik, E. J. 1973. The optimal control of partially observable markov processes over a finite horizon. Operational Research 21:1071–1088. • J. Pineau, G. Gordon & S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico, Aug. 2003. • D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, Blemont, Massachusetts,2001, Vol.1 & Vol.2. • Bellman, R. 1957. Dynamic Programming. Princeton University Press.

More Related