Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

1 / 14

# Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science - PowerPoint PPT Presentation

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs. September 8, 2011. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011. Outline.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science' - jory

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### ECE-517: Reinforcement Learningin Artificial IntelligenceLecture 6: Optimality Criterion in MDPs

September 8, 2011

Dr. Itamar Arel

College of Engineering

Department of Electrical Engineering and Computer Science

The University of Tennessee

Fall 2011

Outline
• Optimal value functions (cont.)
• Implementation considerations
• Optimality and approximation
Recap on Value Functions
• We define the state-value function for policy p as
• Similarly, we define the action-value function for
• The Bellman equation
• The value function Vp(s) is the unique solution to its Bellman equation

0

0

0

Optimal Value Functions
• A policy p is defined to be better than or equal to a policy p*, if its expected return is greater than or equal to that of p* for all states, i.e.
• There is always at least one policy (a.k.a. optimal policy) that is better than or equal to all other policies
• Optimal policies also share the same optimal action-value function, defined as
Optimal Value Functions (cont.)
• The latter gives the expected return for taking action a in state s and thereafter following an optimal policy
• Thus, we can write
• Since V*(s) is the value function for a policy, it must satisfy the Bellman equation
• This is called the Bellman optimality equation
• Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state

Optimal Value Functions (cont.)
• The Bellman optimality equation for Q* is
• Backup diagrams arcs have been added at the agent's choice points to represent that the maximum over that choice is taken rather than the expected value (given some policy)

0

0

Optimal Value Functions (cont.)
• For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy
• The Bellman optimality equation is actually a system of equations, one for each state
• N equations (one for each state)
• N variables – V*(s)
• This assumes you know the dynamics of the environment
• Once one has V*(s), it is relatively easy to determine an optimal policy …
• For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equation
• Any policy that assigns nonzero probability only to these actions is an optimal policy
• This translates to a one-step search, i.e. greedy decisions will be optimal
Optimal Value Functions (cont.)
• With Q*, the agent does not even have to do a one-step-ahead search
• For any state s – the agent can simply find any action that maximizes Q*(s,a)
• The action-value function effectively embeds the results of all one-step-ahead searches
• It provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair
• Agent does not need to know anything about the dynamics of the environment
• Q: What are the implementation tradeoffs here?

Implementation Considerations
• Computational Complexity
• How complex is it to evaluate the value and state-value functions?
• In software
• In hardware
• Data flow constraints
• Which part of the data needs to be globally vs. locally available?
• Impact of memory bandwidth limitations

Recycling Robot revisited
• A transition graph is a useful way to summarize the dynamics of a finite MDP
• State node for each possible state
• Action node for each possible state-action pair

0

Bellman Optimality Equations for the Recycling Robot
• To make things more compact, we abbreviate the states high and low, and the actions search, wait, and recharge respectively by h, l, s, w, and re

0

Optimality and Approximation
• Clearly, an agent that learns an optimal policy has done very well, but in practice this rarely happens
• Usually involves heavy computational load
• Typically agents perform approximations to the optimal policy
• A critical aspect of the problem facing the agent is always the computational resources available to it
• In particular, the amount of computation it can perform in a single time step
• Practical considerations are thus:
• Computational complexity
• Memory available
• Tabular methods apply for small state sets
• Communication overhead (for distributed implementations)
• Hardware vs. software
Are approximations good or bad ?
• RL typically relies on approximation mechanisms (see later)
• This could be an opportunity
• Efficient “Feature-extraction” type of approximation may actually reduce “noise”
• Make it practical for us to address large-scale problems
• In general, making “bad” decisions in RL result in learning opportunities (online)
• The online nature of RL encourages learning more effectively from events that occur frequently
• Supported in nature
• Capturing regularities is a key property of RL