1 / 14

Chapter 17 – Making Complex Decisions

Chapter 17 – Making Complex Decisions. March 30, 2004. 17.1 Sequential Decision Problems. Sequential decision problems vs. episodic ones Fully observable environment Stochastic actions Figure 17.1. Markov Decision Process (MDP). S 0 . Initial state.

sbrescia
Download Presentation

Chapter 17 – Making Complex Decisions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 17 – Making Complex Decisions March 30, 2004

  2. 17.1 Sequential Decision Problems • Sequential decision problems vs. episodic ones • Fully observable environment • Stochastic actions • Figure 17.1

  3. Markov Decision Process (MDP) • S0. Initial state. • T(s, a, s’). Transition Model. Yields a 3-D table filled with probabilities. Markovian. • R(s). Reward associated with a state. Short term reward for state.

  4. Policy, π • π(s). Action recommended in state s. This is the solution to the MDP. • π*(s). Optimal policy. Yields MEU (maximum expected utility) of environment histories. • Figure 17.2

  5. Utility Function (Long term reward for state) • Additive. Uh([s0, s1, …]) = R(s0) + R(s1) + … • Discounted. Uh([s0, s1, …]) = R(s0) + γR(s1) + γ2R(s2) + … 0 <= γ <= 1 • γ = gamma

  6. Discounted Rewards Better • However, there are problems because infinite state sequences can lead to +infinity or –infinity. • Set Rmax, γ < 1 • Guarantee that agent will reach goal • Use average reward/step as basis for comparison γ

  7. 17.2 Value Iteration • MEU principle.π*(s) = argmax a ∑ s’ T(s, a, s’) * U(s’) • Bellman Equation.U(s) = R(s) + γ max a∑s’ T(s, a, s’) * U(s’) • Bellman Update. Ui+1(s) = R(s) + γ max a∑s’ T(s, a, s’) * Ui(s’)

  8. Example • S0 = A • T(A, move, B) = 1. • T(B, move, C) = 1. • R(A) = -.2 • R(B) = -.2 • R(C) = 1 • Figure 17.4

  9. Value Iteration • Convergence guaranteed! • Unique solution guaranteed!

  10. 17.3 Policy Iteration Figure 17.7 • Policy Evaluation. Given a policy πi, calculate the utility of each state. • Policy Improvement. Use a one step look ahead to calculate a better policy, πi+1

  11. Bellman Equation (Standard Policy Iteration) • Old version. U(s) = R(s) + γ max a∑s’ T(s, a, s’) * U(s’) • New version.Ui(s) = R(s) + γ∑s’ T(s, πi(s), s’) * Ui(s’) • This is a linear equation. With n states, there are n3 equations to solve.

  12. Bellman Update (Modified Policy Iteration) • Old method. Ui+1(s) = R(s) + γ max a∑s’ T(s, a, s’) * Ui(s’) • New method. Ui+1(s) = R(s) + γ∑s’ T(s, πi(s), s’) * Ui(s’) • Run k updates for an estimation of the utilities. This is called modified policy iteration

  13. 17.4 Partially Observable MDPs (POMDPs) • Observation Model. O(s, o). Gives the probability of perceiving observation o in state s. • Belief state, b. For example <.5, .5, 0> if a 3 state environment. Mechanisms are needed to update b. • Optimal action depends only on the agent’s current belief state!

  14. Algorithm • Given b, execute action a = π*(b). • Receive observation o. • Update b. b is a continuous, multi-dimensional state. • Repeat.

More Related