Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

ECE-517: Reinforcement Learningin Artificial Intelligence Lecture 12: Generalization and Function Approximation October 23, 2012 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2012

Outline • Introduction • Value Prediction with function approximation • Gradient Descent framework • On-Line Gradient-Descent TD(l) • Linear methods • Control with Function Approximation

Introduction • We have so far assumed a tabular view of value or state-value functions • Inherently limits our problem-space to small state/action sets • Space requirements – storage of values • Computation complexity – sweeping/updating the values • Communication constraints – getting the data where it needs to go • Reality is very different – high-dimensional state representations are common • We will next look at generalizations – an attempt by the agent to learn about a large state set while visiting/ experiencing only a small subset of it • People do it – how can machines achieve the same goal?

General Approach • Luckily, many approximation techniques have been developed • e.g. multivariate function approximation schemes • We will utilize such techniques in a RL context

Value Prediction with FA • As usual, let’s start with prediction of Vp • Instead of using a table for Vt, the latter will be represented in a parameterized functional form • We’ll assume that Vt is a sufficiently smooth differentiable function of , for all s. • For example, a neural network can be trained to predict V where are the connection weights • We will require that is much smaller than the state set • When a single state is backed up, the change generalizes to affect the values of many other states transpose

Adapt Supervised Learning Algorithms Training Info = desired (target) outputs Supervised Learning System Inputs Outputs Training example = {input, target output} Error = (target output – actual output)

Performance Measures • Let us assume that training examples all take the form • A common performance metric is the mean-squared error (MSE) over a distribution P: • Q: Why use P? Is MSE the best metric? • Let us assume that P is always the distribution of states at which backups are done • On-policy distribution: the distribution created while following the policy being evaluated • Stronger results are available for this distribution.

Gradient Descent We iteratively move down the gradient:

Gradient Descent in RL • Let’s now consider the case where the target output, vt, for sample t is not the true value (unavailable) • In such cases we perform an approximate update, such that where vt is an unbiased estimate of the target output. • Example of vt are: • Monte Carlo methods:vt=Rt • TD(l):Rlt • The general gradient-descent is guaranteed to converge to a local minimum

On-Line Gradient-Descent TD(l)

Residual Gradient Descent • The following statement is not completely accurate: since it suggests that which is not true, e.g. so, we should be writing (residual GD): • Comment: the whole scheme is no longer supervised learning based!

Linear Methods • One of the most important special cases of GD FA • Vtbecomes a linear function of the parameters vector • For every state, there is a (real valued) column vector of features • The features can be constructed from the states in many ways • The linear approximate state-value function is given by

Nice Properties of Linear FA Methods • The gradient is very simple: • For MSE, the error surface is simple: quadratic surface with a single (global) minimum • Linear gradient descent TD(l) converges: • Step size decreases appropriately • On-line sampling (states sampled from the on-policy distribution) • Converges to parameter vector with property: best parameter vector (Tsitsiklis & Van Roy, 1997)

Limitations of Pure Linear Methods • Many applications require a mixture (e.g. product) of the different feature components • Linear form prohibits direct representation of the interactions between features • Intuition: feature i is good only in the absence of feature j • Example: Pole Balancing task • High angular velocity can be good or bad … • If the angle is high  imminent danger of falling (bad state) • If the angle is low  the pole is righting itself (good state) • In such cases we need to introduce features that express a mixture of other features

Coarse Coding – Feature Composition/Extraction 0

Shaping Generalization in Coarse Coding • If we train at one point (state), X, the parameters of all circles intersecting X will be affected • Consequence: the value function of all points within the union of the circles will be affected • Greater affects for points that have more circles “in common” with X

Learning and Coarse Coding All three cases have the same number of features (50), learning rate is 0.2/m (m – the number of features present in each example)

Tile Coding 0 Binary feature for each tile Number of features present at any one time is constant Binary features means weighted sum easy to compute Easy to compute indices of the features present

Tile Coding Cont. 0 Irregular tilings Hashing

Control with Function Approximation • Learning state-action values • Training examples of the form: • The general gradient-descent rule: • Gradient-descent Sarsa(l) (backward view):

GPI with Linear Gradient Descent Sarsa(l)

GPI Linear Gradient Descent Watkins’ Q(l)

Mountain-Car Task Example • Challenge: driving an underpowered car up a steep mountain road • Gravity is stronger than its engine • Solution approach: build enough inertia from other slope to carry it up the opposite slope • Example of a task where things can get worse in a sense (farther from the goal) before they get better • Hard to solve using classic control schemes • Reward is -1 for all steps until the episode terminates • Actions full throttle forward (+1), full throttle reverse (-1) and zero throttle (0) • Two 9x9 overlapping tiles were used to represent the continuous state space

Mountain-Car Task

Mountain-Car Results (five 9 by 9 tilings were used)

Summary • Generalization is an important RL attribute • Adapting supervised-learning function approximation methods • Each backup is treated as a learning example • Gradient-descent methods • Linear gradient-descent methods • Radial basis functions • Tile coding • Nonlinear gradient-descent methods? • NN Backpropagation? • Subtleties involving function approximation, bootstrapping and the on-policy/off-policy distinction

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science