Ece 517 reinforcement learning in artificial intelligence lecture 6 optimality criterion in mdps
Sponsored Links
This presentation is the property of its rightful owner.
1 / 14

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs. September 8, 2011. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011. Outline.

Download Presentation

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

ECE-517: Reinforcement Learningin Artificial IntelligenceLecture 6: Optimality Criterion in MDPs

September 8, 2011

Dr. Itamar Arel

College of Engineering

Department of Electrical Engineering and Computer Science

The University of Tennessee

Fall 2011


  • Optimal value functions (cont.)

  • Implementation considerations

  • Optimality and approximation

Recap on Value Functions

  • We define the state-value function for policy p as

  • Similarly, we define the action-value function for

  • The Bellman equation

  • The value function Vp(s) is the unique solution to its Bellman equation




Optimal Value Functions

  • A policy p is defined to be better than or equal to a policy p*, if its expected return is greater than or equal to that of p* for all states, i.e.

  • There is always at least one policy (a.k.a. optimal policy) that is better than or equal to all other policies

  • Optimal policies also share the same optimal action-value function, defined as

Optimal Value Functions (cont.)

  • The latter gives the expected return for taking action a in state s and thereafter following an optimal policy

  • Thus, we can write

  • Since V*(s) is the value function for a policy, it must satisfy the Bellman equation

    • This is called the Bellman optimality equation

  • Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state

Optimal Value Functions (cont.)


Optimal Value Functions (cont.)

  • The Bellman optimality equation for Q* is

  • Backup diagrams arcs have been added at the agent's choice points to represent that the maximum over that choice is taken rather than the expected value (given some policy)



Optimal Value Functions (cont.)

  • For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy

    • The Bellman optimality equation is actually a system of equations, one for each state

    • N equations (one for each state)

    • N variables – V*(s)

  • This assumes you know the dynamics of the environment

  • Once one has V*(s), it is relatively easy to determine an optimal policy …

    • For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equation

    • Any policy that assigns nonzero probability only to these actions is an optimal policy

  • This translates to a one-step search, i.e. greedy decisions will be optimal

Optimal Value Functions (cont.)

  • With Q*, the agent does not even have to do a one-step-ahead search

    • For any state s – the agent can simply find any action that maximizes Q*(s,a)

  • The action-value function effectively embeds the results of all one-step-ahead searches

  • It provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair

    • Agent does not need to know anything about the dynamics of the environment

  • Q: What are the implementation tradeoffs here?

Implementation Considerations

  • Computational Complexity

    • How complex is it to evaluate the value and state-value functions?

    • In software

    • In hardware

  • Data flow constraints

    • Which part of the data needs to be globally vs. locally available?

    • Impact of memory bandwidth limitations

Recycling Robot revisited

  • A transition graph is a useful way to summarize the dynamics of a finite MDP

    • State node for each possible state

    • Action node for each possible state-action pair


Bellman Optimality Equations for the Recycling Robot

  • To make things more compact, we abbreviate the states high and low, and the actions search, wait, and recharge respectively by h, l, s, w, and re


Optimality and Approximation

  • Clearly, an agent that learns an optimal policy has done very well, but in practice this rarely happens

    • Usually involves heavy computational load

  • Typically agents perform approximations to the optimal policy

  • A critical aspect of the problem facing the agent is always the computational resources available to it

    • In particular, the amount of computation it can perform in a single time step

  • Practical considerations are thus:

    • Computational complexity

    • Memory available

      • Tabular methods apply for small state sets

    • Communication overhead (for distributed implementations)

    • Hardware vs. software

Are approximations good or bad ?

  • RL typically relies on approximation mechanisms (see later)

  • This could be an opportunity

    • Efficient “Feature-extraction” type of approximation may actually reduce “noise”

    • Make it practical for us to address large-scale problems

  • In general, making “bad” decisions in RL result in learning opportunities (online)

  • The online nature of RL encourages learning more effectively from events that occur frequently

    • Supported in nature

  • Capturing regularities is a key property of RL

  • Login