Ece 517 reinforcement learning in artificial intelligence lecture 6 optimality criterion in mdps
Download
1 / 14

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science - PowerPoint PPT Presentation


  • 132 Views
  • Uploaded on

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs. September 8, 2011. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011. Outline.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science' - jory


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Ece 517 reinforcement learning in artificial intelligence lecture 6 optimality criterion in mdps

ECE-517: Reinforcement Learningin Artificial IntelligenceLecture 6: Optimality Criterion in MDPs

September 8, 2011

Dr. Itamar Arel

College of Engineering

Department of Electrical Engineering and Computer Science

The University of Tennessee

Fall 2011


Outline
Outline

  • Optimal value functions (cont.)

  • Implementation considerations

  • Optimality and approximation


Recap on value functions
Recap on Value Functions

  • We define the state-value function for policy p as

  • Similarly, we define the action-value function for

  • The Bellman equation

  • The value function Vp(s) is the unique solution to its Bellman equation

0

0

0


Optimal value functions
Optimal Value Functions

  • A policy p is defined to be better than or equal to a policy p*, if its expected return is greater than or equal to that of p* for all states, i.e.

  • There is always at least one policy (a.k.a. optimal policy) that is better than or equal to all other policies

  • Optimal policies also share the same optimal action-value function, defined as


Optimal value functions cont
Optimal Value Functions (cont.)

  • The latter gives the expected return for taking action a in state s and thereafter following an optimal policy

  • Thus, we can write

  • Since V*(s) is the value function for a policy, it must satisfy the Bellman equation

    • This is called the Bellman optimality equation

  • Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state



Optimal value functions cont2
Optimal Value Functions (cont.)

  • The Bellman optimality equation for Q* is

  • Backup diagrams arcs have been added at the agent's choice points to represent that the maximum over that choice is taken rather than the expected value (given some policy)

0

0


Optimal value functions cont3
Optimal Value Functions (cont.)

  • For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy

    • The Bellman optimality equation is actually a system of equations, one for each state

    • N equations (one for each state)

    • N variables – V*(s)

  • This assumes you know the dynamics of the environment

  • Once one has V*(s), it is relatively easy to determine an optimal policy …

    • For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equation

    • Any policy that assigns nonzero probability only to these actions is an optimal policy

  • This translates to a one-step search, i.e. greedy decisions will be optimal


Optimal value functions cont4
Optimal Value Functions (cont.)

  • With Q*, the agent does not even have to do a one-step-ahead search

    • For any state s – the agent can simply find any action that maximizes Q*(s,a)

  • The action-value function effectively embeds the results of all one-step-ahead searches

  • It provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair

    • Agent does not need to know anything about the dynamics of the environment

  • Q: What are the implementation tradeoffs here?


Implementation considerations
Implementation Considerations

  • Computational Complexity

    • How complex is it to evaluate the value and state-value functions?

    • In software

    • In hardware

  • Data flow constraints

    • Which part of the data needs to be globally vs. locally available?

    • Impact of memory bandwidth limitations


Recycling robot revisited
Recycling Robot revisited

  • A transition graph is a useful way to summarize the dynamics of a finite MDP

    • State node for each possible state

    • Action node for each possible state-action pair

0


Bellman optimality equations for the recycling robot
Bellman Optimality Equations for the Recycling Robot

  • To make things more compact, we abbreviate the states high and low, and the actions search, wait, and recharge respectively by h, l, s, w, and re

0


Optimality and approximation
Optimality and Approximation

  • Clearly, an agent that learns an optimal policy has done very well, but in practice this rarely happens

    • Usually involves heavy computational load

  • Typically agents perform approximations to the optimal policy

  • A critical aspect of the problem facing the agent is always the computational resources available to it

    • In particular, the amount of computation it can perform in a single time step

  • Practical considerations are thus:

    • Computational complexity

    • Memory available

      • Tabular methods apply for small state sets

    • Communication overhead (for distributed implementations)

    • Hardware vs. software


Are approximations good or bad
Are approximations good or bad ?

  • RL typically relies on approximation mechanisms (see later)

  • This could be an opportunity

    • Efficient “Feature-extraction” type of approximation may actually reduce “noise”

    • Make it practical for us to address large-scale problems

  • In general, making “bad” decisions in RL result in learning opportunities (online)

  • The online nature of RL encourages learning more effectively from events that occur frequently

    • Supported in nature

  • Capturing regularities is a key property of RL


ad