slide1 n.
Skip this Video
Download Presentation
Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

Loading in 2 Seconds...

play fullscreen
1 / 26

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science - PowerPoint PPT Presentation

  • Uploaded on

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation. October 23, 2012. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2012. Outline.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science' - clare-hinton

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

ECE-517: Reinforcement Learningin Artificial Intelligence Lecture 12: Generalization and Function Approximation

October 23, 2012

Dr. Itamar Arel

College of Engineering

Department of Electrical Engineering and Computer Science

The University of Tennessee

Fall 2012

  • Introduction
  • Value Prediction with function approximation
  • Gradient Descent framework
    • On-Line Gradient-Descent TD(l)
    • Linear methods
  • Control with Function Approximation
  • We have so far assumed a tabular view of value or state-value functions
  • Inherently limits our problem-space to small state/action sets
    • Space requirements – storage of values
    • Computation complexity – sweeping/updating the values
    • Communication constraints – getting the data where it needs to go
  • Reality is very different – high-dimensional state representations are common
  • We will next look at generalizations – an attempt by the agent to learn about a large state set while visiting/ experiencing only a small subset of it
    • People do it – how can machines achieve the same goal?
general approach
General Approach
  • Luckily, many approximation techniques have been developed
    • e.g. multivariate function approximation schemes
  • We will utilize such techniques in a RL context
value prediction with fa
Value Prediction with FA
  • As usual, let’s start with prediction of Vp
  • Instead of using a table for Vt, the latter will be represented in a parameterized functional form
  • We’ll assume that Vt is a sufficiently smooth differentiable function of , for all s.
  • For example, a neural network can be trained to predict V where are the connection weights
  • We will require that is much smaller than the state set
  • When a single state is backed up, the change generalizes to affect the values of many other states


adapt supervised learning algorithms
Adapt Supervised Learning Algorithms

Training Info = desired (target) outputs

Supervised Learning




Training example = {input, target output}

Error = (target output – actual output)

performance measures
Performance Measures
  • Let us assume that training examples all take the form
  • A common performance metric is the mean-squared error (MSE) over a distribution P:
  • Q: Why use P? Is MSE the best metric?
  • Let us assume that P is always the distribution of states at which backups are done
  • On-policy distribution: the distribution created while following the policy being evaluated
    • Stronger results are available for this distribution.
gradient descent
Gradient Descent

We iteratively move down the gradient:

gradient descent in rl
Gradient Descent in RL
  • Let’s now consider the case where the target output, vt, for sample t is not the true value (unavailable)
  • In such cases we perform an approximate update, such that

where vt is an unbiased estimate of the target output.

  • Example of vt are:
    • Monte Carlo methods:vt=Rt
    • TD(l):Rlt
  • The general gradient-descent is guaranteed to converge to a local minimum
residual gradient descent
Residual Gradient Descent
  • The following statement is not completely accurate:

since it suggests that which is not true, e.g.

so, we should be writing (residual GD):

  • Comment: the whole scheme is no longer supervised learning based!
linear methods
Linear Methods
  • One of the most important special cases of GD FA
  • Vtbecomes a linear function of the parameters vector
  • For every state, there is a (real valued) column vector of features
  • The features can be constructed from the states in many ways
  • The linear approximate state-value function is given by
nice properties of linear fa methods
Nice Properties of Linear FA Methods
  • The gradient is very simple:
  • For MSE, the error surface is simple: quadratic surface with a single (global) minimum
  • Linear gradient descent TD(l) converges:
    • Step size decreases appropriately
    • On-line sampling (states sampled from the on-policy distribution)
    • Converges to parameter vector with property:

best parameter vector

(Tsitsiklis & Van Roy, 1997)

limitations of pure linear methods
Limitations of Pure Linear Methods
  • Many applications require a mixture (e.g. product) of the different feature components
    • Linear form prohibits direct representation of the interactions between features
    • Intuition: feature i is good only in the absence of feature j
  • Example: Pole Balancing task
    • High angular velocity can be good or bad …
      • If the angle is high  imminent danger of falling (bad state)
      • If the angle is low  the pole is righting itself (good state)
  • In such cases we need to introduce features that express a mixture of other features
shaping generalization in coarse coding
Shaping Generalization in Coarse Coding
  • If we train at one point (state), X, the parameters of all circles intersecting X will be affected
  • Consequence: the value function of all points within the union of the circles will be affected
  • Greater affects for points that have more circles “in common” with X
learning and coarse coding
Learning and Coarse Coding

All three cases have the same number of features (50), learning rate is 0.2/m (m – the number of features present in each example)

tile coding
Tile Coding


Binary feature for each tile

Number of features present at any one time is constant

Binary features means weighted sum easy to compute

Easy to compute indices of the features present

tile coding cont
Tile Coding Cont.


Irregular tilings


control with function approximation
Control with Function Approximation
  • Learning state-action values
    • Training examples of the form:
  • The general gradient-descent rule:
  • Gradient-descent Sarsa(l) (backward view):
mountain car task example
Mountain-Car Task Example
  • Challenge: driving an underpowered car up a steep mountain road
    • Gravity is stronger than its engine
  • Solution approach: build enough inertia from other slope to carry it up the opposite slope
  • Example of a task where things can get worse in a sense (farther from the goal) before they get better
    • Hard to solve using classic control schemes
  • Reward is -1 for all steps until the episode terminates
  • Actions full throttle forward (+1), full throttle reverse (-1) and zero throttle (0)
  • Two 9x9 overlapping tiles were used to represent the continuous state space
  • Generalization is an important RL attribute
  • Adapting supervised-learning function approximation methods
    • Each backup is treated as a learning example
  • Gradient-descent methods
  • Linear gradient-descent methods
    • Radial basis functions
    • Tile coding
  • Nonlinear gradient-descent methods?
    • NN Backpropagation?
  • Subtleties involving function approximation, bootstrapping and the on-policy/off-policy distinction