Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

Download Presentation

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

Loading in 2 Seconds...

- 70 Views
- Uploaded on
- Presentation posted in: General

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

ECE-517: Reinforcement Learningin Artificial IntelligenceLecture 6: Optimality Criterion in MDPs

September 8, 2011

Dr. Itamar Arel

College of Engineering

Department of Electrical Engineering and Computer Science

The University of Tennessee

Fall 2011

- Optimal value functions (cont.)
- Implementation considerations
- Optimality and approximation

- We define the state-value function for policy p as
- Similarly, we define the action-value function for
- The Bellman equation
- The value function Vp(s) is the unique solution to its Bellman equation

0

∆

0

0

- A policy p is defined to be better than or equal to a policy p*, if its expected return is greater than or equal to that of p* for all states, i.e.
- There is always at least one policy (a.k.a. optimal policy) that is better than or equal to all other policies
- Optimal policies also share the same optimal action-value function, defined as

- The latter gives the expected return for taking action a in state s and thereafter following an optimal policy
- Thus, we can write
- Since V*(s) is the value function for a policy, it must satisfy the Bellman equation
- This is called the Bellman optimality equation

- Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state

∆

0

∆

- The Bellman optimality equation for Q* is
- Backup diagrams arcs have been added at the agent's choice points to represent that the maximum over that choice is taken rather than the expected value (given some policy)

0

0

- For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy
- The Bellman optimality equation is actually a system of equations, one for each state
- N equations (one for each state)
- N variables – V*(s)

- This assumes you know the dynamics of the environment
- Once one has V*(s), it is relatively easy to determine an optimal policy …
- For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equation
- Any policy that assigns nonzero probability only to these actions is an optimal policy

- This translates to a one-step search, i.e. greedy decisions will be optimal

- With Q*, the agent does not even have to do a one-step-ahead search
- For any state s – the agent can simply find any action that maximizes Q*(s,a)

- The action-value function effectively embeds the results of all one-step-ahead searches
- It provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair
- Agent does not need to know anything about the dynamics of the environment

- Q: What are the implementation tradeoffs here?

∆

- Computational Complexity
- How complex is it to evaluate the value and state-value functions?
- In software
- In hardware

- Data flow constraints
- Which part of the data needs to be globally vs. locally available?
- Impact of memory bandwidth limitations

∆

- A transition graph is a useful way to summarize the dynamics of a finite MDP
- State node for each possible state
- Action node for each possible state-action pair

0

- To make things more compact, we abbreviate the states high and low, and the actions search, wait, and recharge respectively by h, l, s, w, and re

0

- Clearly, an agent that learns an optimal policy has done very well, but in practice this rarely happens
- Usually involves heavy computational load

- Typically agents perform approximations to the optimal policy
- A critical aspect of the problem facing the agent is always the computational resources available to it
- In particular, the amount of computation it can perform in a single time step

- Practical considerations are thus:
- Computational complexity
- Memory available
- Tabular methods apply for small state sets

- Communication overhead (for distributed implementations)
- Hardware vs. software

- RL typically relies on approximation mechanisms (see later)
- This could be an opportunity
- Efficient “Feature-extraction” type of approximation may actually reduce “noise”
- Make it practical for us to address large-scale problems

- In general, making “bad” decisions in RL result in learning opportunities (online)
- The online nature of RL encourages learning more effectively from events that occur frequently
- Supported in nature

- Capturing regularities is a key property of RL