1 / 14

# Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science - PowerPoint PPT Presentation

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs. September 8, 2011. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011. Outline.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science' - jory

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### ECE-517: Reinforcement Learningin Artificial IntelligenceLecture 6: Optimality Criterion in MDPs

September 8, 2011

Dr. Itamar Arel

College of Engineering

Department of Electrical Engineering and Computer Science

The University of Tennessee

Fall 2011

• Optimal value functions (cont.)

• Implementation considerations

• Optimality and approximation

• We define the state-value function for policy p as

• Similarly, we define the action-value function for

• The Bellman equation

• The value function Vp(s) is the unique solution to its Bellman equation

0

0

0

• A policy p is defined to be better than or equal to a policy p*, if its expected return is greater than or equal to that of p* for all states, i.e.

• There is always at least one policy (a.k.a. optimal policy) that is better than or equal to all other policies

• Optimal policies also share the same optimal action-value function, defined as

• The latter gives the expected return for taking action a in state s and thereafter following an optimal policy

• Thus, we can write

• Since V*(s) is the value function for a policy, it must satisfy the Bellman equation

• This is called the Bellman optimality equation

• Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state

• The Bellman optimality equation for Q* is

• Backup diagrams arcs have been added at the agent's choice points to represent that the maximum over that choice is taken rather than the expected value (given some policy)

0

0

• For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy

• The Bellman optimality equation is actually a system of equations, one for each state

• N equations (one for each state)

• N variables – V*(s)

• This assumes you know the dynamics of the environment

• Once one has V*(s), it is relatively easy to determine an optimal policy …

• For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equation

• Any policy that assigns nonzero probability only to these actions is an optimal policy

• This translates to a one-step search, i.e. greedy decisions will be optimal

• With Q*, the agent does not even have to do a one-step-ahead search

• For any state s – the agent can simply find any action that maximizes Q*(s,a)

• The action-value function effectively embeds the results of all one-step-ahead searches

• It provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair

• Agent does not need to know anything about the dynamics of the environment

• Q: What are the implementation tradeoffs here?

• Computational Complexity

• How complex is it to evaluate the value and state-value functions?

• In software

• In hardware

• Data flow constraints

• Which part of the data needs to be globally vs. locally available?

• Impact of memory bandwidth limitations

• A transition graph is a useful way to summarize the dynamics of a finite MDP

• State node for each possible state

• Action node for each possible state-action pair

0

• To make things more compact, we abbreviate the states high and low, and the actions search, wait, and recharge respectively by h, l, s, w, and re

0

• Clearly, an agent that learns an optimal policy has done very well, but in practice this rarely happens

• Usually involves heavy computational load

• Typically agents perform approximations to the optimal policy

• A critical aspect of the problem facing the agent is always the computational resources available to it

• In particular, the amount of computation it can perform in a single time step

• Practical considerations are thus:

• Computational complexity

• Memory available

• Tabular methods apply for small state sets

• Communication overhead (for distributed implementations)

• Hardware vs. software

• RL typically relies on approximation mechanisms (see later)

• This could be an opportunity

• Efficient “Feature-extraction” type of approximation may actually reduce “noise”

• Make it practical for us to address large-scale problems

• In general, making “bad” decisions in RL result in learning opportunities (online)

• The online nature of RL encourages learning more effectively from events that occur frequently

• Supported in nature

• Capturing regularities is a key property of RL