Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science

1 / 26

Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science - PowerPoint PPT Presentation

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation. October 23, 2012. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2012. Outline.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about 'Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science' - clare-hinton

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

ECE-517: Reinforcement Learningin Artificial Intelligence Lecture 12: Generalization and Function Approximation

October 23, 2012

Dr. Itamar Arel

College of Engineering

Department of Electrical Engineering and Computer Science

The University of Tennessee

Fall 2012

Outline
• Introduction
• Value Prediction with function approximation
• Linear methods
• Control with Function Approximation
Introduction
• We have so far assumed a tabular view of value or state-value functions
• Inherently limits our problem-space to small state/action sets
• Space requirements – storage of values
• Computation complexity – sweeping/updating the values
• Communication constraints – getting the data where it needs to go
• Reality is very different – high-dimensional state representations are common
• We will next look at generalizations – an attempt by the agent to learn about a large state set while visiting/ experiencing only a small subset of it
• People do it – how can machines achieve the same goal?
General Approach
• Luckily, many approximation techniques have been developed
• e.g. multivariate function approximation schemes
• We will utilize such techniques in a RL context
Value Prediction with FA
• Instead of using a table for Vt, the latter will be represented in a parameterized functional form
• We’ll assume that Vt is a sufficiently smooth differentiable function of , for all s.
• For example, a neural network can be trained to predict V where are the connection weights
• We will require that is much smaller than the state set
• When a single state is backed up, the change generalizes to affect the values of many other states

transpose

Training Info = desired (target) outputs

Supervised Learning

System

Inputs

Outputs

Training example = {input, target output}

Error = (target output – actual output)

Performance Measures
• Let us assume that training examples all take the form
• A common performance metric is the mean-squared error (MSE) over a distribution P:
• Q: Why use P? Is MSE the best metric?
• Let us assume that P is always the distribution of states at which backups are done
• On-policy distribution: the distribution created while following the policy being evaluated
• Stronger results are available for this distribution.

We iteratively move down the gradient:

• Let’s now consider the case where the target output, vt, for sample t is not the true value (unavailable)
• In such cases we perform an approximate update, such that

where vt is an unbiased estimate of the target output.

• Example of vt are:
• Monte Carlo methods:vt=Rt
• TD(l):Rlt
• The general gradient-descent is guaranteed to converge to a local minimum
• The following statement is not completely accurate:

since it suggests that which is not true, e.g.

so, we should be writing (residual GD):

• Comment: the whole scheme is no longer supervised learning based!
Linear Methods
• One of the most important special cases of GD FA
• Vtbecomes a linear function of the parameters vector
• For every state, there is a (real valued) column vector of features
• The features can be constructed from the states in many ways
• The linear approximate state-value function is given by
Nice Properties of Linear FA Methods
• The gradient is very simple:
• For MSE, the error surface is simple: quadratic surface with a single (global) minimum
• Linear gradient descent TD(l) converges:
• Step size decreases appropriately
• On-line sampling (states sampled from the on-policy distribution)
• Converges to parameter vector with property:

best parameter vector

(Tsitsiklis & Van Roy, 1997)

Limitations of Pure Linear Methods
• Many applications require a mixture (e.g. product) of the different feature components
• Linear form prohibits direct representation of the interactions between features
• Intuition: feature i is good only in the absence of feature j
• High angular velocity can be good or bad …
• If the angle is high  imminent danger of falling (bad state)
• If the angle is low  the pole is righting itself (good state)
• In such cases we need to introduce features that express a mixture of other features
Shaping Generalization in Coarse Coding
• If we train at one point (state), X, the parameters of all circles intersecting X will be affected
• Consequence: the value function of all points within the union of the circles will be affected
• Greater affects for points that have more circles “in common” with X
Learning and Coarse Coding

All three cases have the same number of features (50), learning rate is 0.2/m (m – the number of features present in each example)

Tile Coding

0

Binary feature for each tile

Number of features present at any one time is constant

Binary features means weighted sum easy to compute

Easy to compute indices of the features present

Tile Coding Cont.

0

Irregular tilings

Hashing

Control with Function Approximation
• Learning state-action values
• Training examples of the form:
• Challenge: driving an underpowered car up a steep mountain road
• Gravity is stronger than its engine
• Solution approach: build enough inertia from other slope to carry it up the opposite slope
• Example of a task where things can get worse in a sense (farther from the goal) before they get better
• Hard to solve using classic control schemes
• Reward is -1 for all steps until the episode terminates
• Actions full throttle forward (+1), full throttle reverse (-1) and zero throttle (0)
• Two 9x9 overlapping tiles were used to represent the continuous state space
Summary
• Generalization is an important RL attribute
• Adapting supervised-learning function approximation methods
• Each backup is treated as a learning example