1 / 31

MDP Presentation CS594 Automated Optimal Decision Making

MDP Presentation CS594 Automated Optimal Decision Making. Sohail M Yousof Advanced Artificial Intelligence. Topic. Planning and Control in Stochastic Domains With Imperfect Information. Objective. Markov Decision Processes (Sequences of decisions) Introduction to MDPs

melody
Download Presentation

MDP Presentation CS594 Automated Optimal Decision Making

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MDP PresentationCS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence

  2. Topic Planning and Control in Stochastic Domains With Imperfect Information

  3. Objective • Markov Decision Processes (Sequences of decisions) • Introduction to MDPs • Computing optimal policies for MDPs

  4. Markov Decision Process (MDP) • Sequential decision problems under uncertainty • Not just the immediate utility, but the longer-term utility as well • Uncertainty in outcomes • Roots in operations research • Also used in economics, communications engineering, ecology, performance modeling and of course, AI! • Also referred to as stochastic dynamic programs

  5. Markov Decision Process (MDP) • Defined as a tuple: <S, A, P, R> • S: State • A: Action • P: Transition function • Table P(s’| s, a), prob of s’ given action “a” in state “s” • R: Reward • R(s, a) = cost or reward of taking action a in state s • Choose a sequence of actions (not just one decision or one action) • Utility based on a sequence of decisions

  6. Reward +1 Blocked CELL Reward -1 Example: What SEQUENCE of actions should our agent take? • Each action costs –1/25 • Agent can take action N, E, S, W • Faces uncertainty in every state 1 N 0.8 2 0.1 0.1 3 1 2 3 4 Start

  7. MDP Tuple: <S, A, P, R> • S: State of the agent on the grid (4,3) • Note that cell denoted by (x,y) • A: Actions of the agent, i.e., N, E, S, W • P: Transition function • Table P(s’| s, a), prob of s’ given action “a” in state “s” • E.g., P( (4,3) | (3,3), N) = 0.1 • E.g., P((3, 2) | (3,3), N) = 0.8 • (Robot movement, uncertainty of another agent’s actions,…) • R: Reward (more comments on the reward function later) • R( (3, 3), N) = -1/25 • R (4,1) = +1

  8. ??Terminology • Before describing policies, lets go through some terminology • Terminology useful throughout this set of lectures • Policy: Complete mapping from states to actions

  9. MDP Basics and Terminology An agent must make a decision or control a probabilistic system • Goal is to choose a sequence of actions for optimality • Defined as <S, A, P, R> • MDP models: • Finite horizon: Maximize the expected reward for the next n steps • Infinite horizon: Maximize the expected discounted reward. • Transition model: Maximize average expected reward per transition. • Goal state: maximize expected reward (minimize expected cost) to some target state G.

  10. ???Reward Function • According to chapter2, directly associated with state • Denoted R(I) • Simplifies computations seen later in algorithms presented • Sometimes, reward is assumed associated with state,action • R(S, A) • We could also assume a mix of R(S,A) and R(S) • Sometimes, reward associated with state,action,destination-state • R(S,A,J) • R(S,A) = S R(S,A,J) * P(J | S, A) J

  11. Markov Assumption • Markov Assumption: Transition probabilities (and rewards) from any given state depend only on the state and not on previous history • Where you end up after action depends only on current state • After Russian Mathematician A. A. Markov (1856-1922) • (He did not come up with markov decision processes however) • Transitions in state (1,2) do not depend on prior state (1,1) or (1,2)

  12. ???MDP vs POMDPs • Accessibility: Agent’s percept in any given state identify the state that it is in, e.g., state (4,3) vs (3,3) • Given observations, uniquely determine the state • Hence, we will not explicitly consider observations, only states • Inaccessibility: Agent’s percepts in any given state DO NOT identify the state that it is in, e.g., may be (4,3) or (3,3) • Given observations, not uniquely determine the state • POMDP: Partially observable MDP for inaccessible environments • We will focus on MDPs in this presentation.

  13. MDP vs POMDP MDP World Actions World Observations States Agent SE b P Agent Actions

  14. Stationary and Deterministic Policies • Policy denoted by symbol 

  15. Policy • Policy is like a plan, but not quite • Certainly, generated ahead of time, like a plan • Unlike traditional plans, it is not a sequence of actions that an agent must execute • If there are failures in execution, agent can continue to execute a policy • Prescribes an action for all the states • Maximizes expected reward, rather than just reaching a goal state

  16. MDP problem The MDP problem consists of: • Finding the optimal control policy for all possible states; • Finding the sequence of optimal control functions for a specific initial state • Finding the best control action(decision) for a specific state.

  17. +1 -1 Non-Optimal Vs Optimal Policy 1 2 3 Start 1 2 3 4 • Choose Red policy or Yellow policy? • Choose Red policy or Blue policy? Which is optimal (if any)? • Value iteration: One popular algorithm to determine optimal policy

  18. Value Iteration: Key Idea • Iterate: update utility of state “I” using old utility of neighbor states “J”; given actions “A” • U t+1 (I) = max [R(I,A) + SP(J|I,A)* Ut(J)] A J • P(J|I,A): Probability of J if A is taken in state I • max F(A) returns highest F(A) • Immediate reward & longer term reward taken into account

  19. Value Iteration: Algorithm • Initialize: U0 (I) = 0 • Iterate: U t+1 (I) = max [ R(I,A) + S P(J|I,A)* U t (J) ] A J • Until close-enough (U t+1, Ut) • At the end of iteration, calculate optimal policy: Policy(I) = argmax [R(I,A) + S P(J|I,A)* U t+1 (J) ] A J

  20. Forward Methodfor Solving MDP Decision Tree

  21. ??Markov Chain • Given fixed policy, you get a markov chain from the MDP • Markov chain: Next state is dependent only on previous state • Next state: Not dependent on action (there is only one action) • Next state: History dependency only via the previous state • P(S t+1 | St, S t-1, S t-2 …..) = P(S t+1 | St) • How to evaluate the markov chain? • Could we try simulations? • Are there other sophisticated methods around?

  22. Influence Diagram

  23. Expanded Influence Diagram

  24. Relation between time & steps-to-go

  25. Decision Tree

  26. Dynamic Constructionof the Decision Tree • Incrémental expansion(MDP,γ, sI, є, VL, VU) Initialize tree T with sI and ubound (sI), lbound (sI) using VL, VU; repeat until(single action remains for sI or ubound (sI) - lbound (sI) <= є call Improve-tree(T,MDP,γ, VL, VU) return action with greatest lover bound as a result; Improve-tree (T,MDP,γ, VL, VU) if root(T) is a leaf then expand root(T) set bouds lbound, ubound of new leaves using VL, VU; else for all decision subtrees T’ of T do call Improve-tree (T,MDP,γ, VL, VU) recompute bounds lbound(root(T)), ubound(root(T))for root(T); when root(T) is a decision node prune suboptimal action branches from T; return;

  27. Incremental expansion function: Basic Method for the Dynamic Construction of the Decision Tree start MDP, γ, SI, ε, VL, VU initialize leaf node of the partially built decision tree OR SI)-bound(SI) return call Improve-tree(T,MDP, γ, ε, VL, VU) Terminate

  28. Computer Decisionsusing Bound Iteration • Incrémental expansion(MDP,γ, sI, є, VL, VU) Initialize tree T with sI and ubound (sI), lbound (sI) using VL, VU; repeat until(single action remains for sI or ubound (sI) - lbound (sI) <= є call Improve-tree(T,MDP,γ, VL, VU) return action with greatest lover bound as a result; Improve-tree (T,MDP,γ, VL, VU) if root(T) is a leaf then expand root(T) set bouds lbound, ubound of new leaves using VL, VU; else for all decision subtrees T’ of T do call Improve-tree (T,MDP,γ, VL, VU) recompute bounds lbound(root(T)), ubound(root(T))for root(T); when root(T) is a decision node prune suboptimal action branches from T; return;

  29. Incremental expansion function: Basic Method for the Dynamic Construction of the Decision Tree start MDP, γ, SI, ε, VL, VU initialize leaf node of the partially built decision tree OR (SI)-bound(SI) return call Improve-tree(T,MDP, γ, ε, VL, VU) Terminate

  30. Solving Large MDP problmes

  31. If You Want to Read Moreon MDPs If You Want to Read Moreon MDPs • Book: • Martin L. Puterman • Markov Decision Processes • Wiley Series in Probability • Available on Amazon.com

More Related