Loading in 5 sec....

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer SciencePowerPoint Presentation

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

Download Presentation

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

Loading in 2 Seconds...

- 83 Views
- Uploaded on
- Presentation posted in: General

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

ECE-517: Reinforcement Learningin Artificial IntelligenceLecture 6: Optimality Criterion in MDPs

September 8, 2011

Dr. Itamar Arel

College of Engineering

Department of Electrical Engineering and Computer Science

The University of Tennessee

Fall 2011

- Optimal value functions (cont.)
- Implementation considerations
- Optimality and approximation

- We define the state-value function for policy p as
- Similarly, we define the action-value function for
- The Bellman equation
- The value function Vp(s) is the unique solution to its Bellman equation

0

∆

0

0

- A policy p is defined to be better than or equal to a policy p*, if its expected return is greater than or equal to that of p* for all states, i.e.
- There is always at least one policy (a.k.a. optimal policy) that is better than or equal to all other policies
- Optimal policies also share the same optimal action-value function, defined as

- The latter gives the expected return for taking action a in state s and thereafter following an optimal policy
- Thus, we can write
- Since V*(s) is the value function for a policy, it must satisfy the Bellman equation
- This is called the Bellman optimality equation

- Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state

∆

0

∆

- The Bellman optimality equation for Q* is
- Backup diagrams arcs have been added at the agent's choice points to represent that the maximum over that choice is taken rather than the expected value (given some policy)

0

0

- For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy
- The Bellman optimality equation is actually a system of equations, one for each state
- N equations (one for each state)
- N variables – V*(s)

- This assumes you know the dynamics of the environment
- Once one has V*(s), it is relatively easy to determine an optimal policy …
- For each state there will be one or more actions for which the maximum is obtained in the Bellman optimality equation
- Any policy that assigns nonzero probability only to these actions is an optimal policy

- This translates to a one-step search, i.e. greedy decisions will be optimal

- With Q*, the agent does not even have to do a one-step-ahead search
- For any state s – the agent can simply find any action that maximizes Q*(s,a)

- The action-value function effectively embeds the results of all one-step-ahead searches
- It provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair
- Agent does not need to know anything about the dynamics of the environment

- Q: What are the implementation tradeoffs here?

∆

- Computational Complexity
- How complex is it to evaluate the value and state-value functions?
- In software
- In hardware

- Data flow constraints
- Which part of the data needs to be globally vs. locally available?
- Impact of memory bandwidth limitations

∆

- A transition graph is a useful way to summarize the dynamics of a finite MDP
- State node for each possible state
- Action node for each possible state-action pair

0

- To make things more compact, we abbreviate the states high and low, and the actions search, wait, and recharge respectively by h, l, s, w, and re

0

- Clearly, an agent that learns an optimal policy has done very well, but in practice this rarely happens
- Usually involves heavy computational load

- Typically agents perform approximations to the optimal policy
- A critical aspect of the problem facing the agent is always the computational resources available to it
- In particular, the amount of computation it can perform in a single time step

- Practical considerations are thus:
- Computational complexity
- Memory available
- Tabular methods apply for small state sets

- Communication overhead (for distributed implementations)
- Hardware vs. software

- RL typically relies on approximation mechanisms (see later)
- This could be an opportunity
- Efficient “Feature-extraction” type of approximation may actually reduce “noise”
- Make it practical for us to address large-scale problems

- In general, making “bad” decisions in RL result in learning opportunities (online)
- The online nature of RL encourages learning more effectively from events that occur frequently
- Supported in nature

- Capturing regularities is a key property of RL