Applying online search techniques to reinforcement learning
1 / 29

Applying Online Search Techniques to Reinforcement Learning - PowerPoint PPT Presentation

  • Uploaded on

Applying Online Search Techniques to Reinforcement Learning. Scott Davies, Andrew Ng, and Andrew Moore Carnegie Mellon University. The Agony of Continuous State Spaces. Learning useful value functions for continuous-state optimal control problems can be difficult

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Applying Online Search Techniques to Reinforcement Learning' - sanura

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Applying online search techniques to reinforcement learning

Applying Online Search Techniques to Reinforcement Learning

Scott Davies, Andrew Ng, and Andrew Moore

Carnegie Mellon University

The agony of continuous state spaces
The Agony of Continuous State Spaces

  • Learning useful value functions for continuous-state optimal control problems can be difficult

    • Small inaccuracies/inconsistencies in approximated value functions can cause simple controllers to fail miserably

    • Accurate value functions can be very expensive to compute even in relatively low-dimensional spaces with perfectly accurate state transition models

Combining value functions with online search
Combining Value Functions With Online Search

  • Instead of modeling the value function accurately everywhere, we can perform online searches for good trajectories from the agent’s current position to compensate for value function inaccuracies

  • We examine two different types of search:

    • “Local” searches in which the agent performs a finite-depth look-ahead search

    • “Global” searches in which the agent searches for trajectories all the way to goal states

Typical one step search

where RT is the reward accumulated along T

is the discount factor

xT is the state at the end of T

Typical One-Step “Search”

Given a value function V(x) over the state space, an agent typically uses a model to predict where each possible one-step trajectory T takes it, then chooses the trajectory that maximizes

RT +V(xT)

This takes O(|A|) time, where A is the set of possible actions.

Given a perfect V(x), this would lead to optimal behavior.

Local search
Local Search

  • An obvious possible extension: consider all possible d-step trajectories T, selecting the one that maximizes RT + dV(xT).

    • Computational expense: O(|A|d).

  • To make deeper searches more computationally tractable, we can limit agent to considering only trajectories in which the action is switched at most s times.

    • Computational expense:

      (considerably cheaper than full d-step search if s << d)

Local search example



Local Search: Example

  • Two-dimensional state space (position + velocity)

  • Car must back up to take “running start” to make it

Search over 20-step trajectories

with at most one switch in actions

Using local search online
Using Local Search Online


  • From current state, consider all possible d-step trajectories T in which the action is changed at most s times

  • Perform the first action in the trajectory that maximizes RT + dV(xT).

    Let B denote the “parallel backup operator” such that

If s = (d-1), Local Search is formally equivalent to behaving greedily

with respect to the new value function Bd-1V.

Since V is typically arrived at through iterations of a much cruder backup

operator, this value function is often much more accurate than V.

Uninformed global search
Uninformed Global Search

  • Suppose we have a minimum-cost-to-goal problem in a continuous state space with nonnegative costs. Why not forget about explicitly calculating V and just extend the search from the current position all the way to the goal?

  • Problem: combinatorial explosion.

  • Possible solution:

    • Break state space into partitions, e.g. a uniform grid. (Can be represented sparsely.)

    • Use previously discussed local search procedure to find trajectories between partitions

    • Prune all but least-cost trajectory entering any given partition

Uninformed global search1
Uninformed Global Search

  • Problems:

    • Still computationally expensive

    • Even with fine partitioning of state space, pruning the wrong trajectories can cause search to fail

Informed global search
Informed Global Search

  • Use approximate value function V to guide the selection of which points to search from next

  • Reasonably accurate V will cause search to stay along optimal path to goal: dramatic reduction in search time

  • V can help choose effective points within each partition from which to search, thereby improving solution quality

  • Uniformed Global Search same as “Informed” Global Search with V(x) = 0

Informed global search algorithm
Informed Global Search Algorithm

  • Let x0 be current state, and g(x0) be the grid element containing x0

  • Set g(x0)’s “representative state” to x0, and add g(x0) to priority queue P with priority V(x0)

  • Until goal state found or P empty:

    • Remove grid element g from top of P. Let x denote g’s “representative state.”

    • SEARCH-FROM(g, x)

  • If goal found, execute trajectory; otherwise signal failure

Informed global search algorithm cont d
Informed Global Search Algorithm, cont’d


  • Starting from x, perform “local search” as described earlier, but prune the search wherever it reaches a different grid element g  g.

  • Each time another grid element g reached at state x:

    • If g previously SEARCHED-FROM, do nothing.

    • If g never previously reached, add g to P with priority RT(x0…x) +  |T|V(x), where T is trajectory from x0 to x. Set g’s “representative state” to x. Record trajectory from x to x.

    • If g previously reached but previous priority is lower than RT(x0…x) +  |T|V(x), update g s priority to RT(x0…x) +  |T|V(x) and set “representative state” to x. Record trajectory from x to x.

Informed global search examples
Informed Global Search Examples

7*7 simplex-interpolated V

13*13 simplex-interpolated V

Hill-car Search Trees

Informed global search as a
Informed Global Search as A*

  • Informed Global Search is essentially an A* search using the value function V as a search heuristic

  • Using A* with an optimistic heuristic function normally guarantees optimal path to the goal.

  • Uninformed global search effectively uses trivially optimistic heuristic V(s) = 0. Might we expect better solution quality with uninformed search than with non-optimistic crude approximate value function V?

  • Not necessarily! A crude approximate non-optimistic value function can improve solution quality by helping the algorithm avoid pruning wrong parts of search tree

Hill car

  • Car on steep hill

  • State variables: position and velocity (2-d)

  • Actions: accelerate forward or backward

  • Goal: park near top

  • Random start states

  • Cost: total time to goal


  • Two-link planar robot acting in vertical plane under gravity

  • Underactuated joint at elbow; unactuated shoulder

  • Two angular positions & their velocities (4-d)

  • Goal: raise tip at least one link’s height above shoulder

  • Two actions: full torque clockwise / counterclockwise

  • Random starting positions

  • Cost: total time to goal




Move cart pole

  • Upright pole attached to cart by unactuated joint

  • State: horizontal position of cart, angle of pole, and associated velocities (4-d)

  • Actions: accelerate left or right

  • Goal configuration: cart moved, pole balanced

  • Start with random x;  = 0

  • Per-step cost quadratic in distance from goal configuration

  • Big penalty if pole falls over

Goal configuration


Planar slider
Planar Slider

  • Puck sliding on bumpy 2-d surface

  • Two spatial variables & their velocities (4-d)

  • Actions: accelerate NW, NE, SW, or SE

  • Goal in NW corner

  • Random start states

  • Cost: total time to goal

Local search experiments
Local Search Experiments


  • CPU Time and Solution cost vs. search depth d

  • No limits imposed on number of action switches (s=d)

  • Value function: 134 simplex-interpolation grid

Local search experiments1
Local Search Experiments


  • CPU Time and Solution cost vs. search depth d

  • Max. number of action switches fixed at 2 (s = 2)

  • Value function: 72 simplex-interpolated value function

Comparative experiments hill car
Comparative experiments: Hill-Car

  • Local search: d=6, s=2

  • Global searches:

    • Local search between grid elements: d=20, s=1

    • 502 search grid resolution

  • 72 simplex-interpolated value function

Hill car results cont d
Hill-Car results cont’d

  • Uninformed Global Search prunes wrong trajectories

  • Increase search grid to 1002 so this doesn’t happen:

    • Uninformed does near-optimal

    • Informed doesn’t: crude value function not optimistic

Failed search trajectory picture goes


Comparative results four d domains
Comparative Results: Four-d domains

All value functions: 134 simplex interpolations

All local searches between global search elements:

depth 20, with at max. 1 action switch (d=20, s=1)

  • Acrobot:

    • Local Search: depth 4; no action switch restriction (d=4,s=4)

    • Global: 504 search grid

  • Move-Cart-Pole: same as Acrobot

  • Slider:

    • Local Search: depth 10; max. 1 action switch (d=10,s=1)

    • Global: 204 search grid


  • Local search significantly improves solution quality, but increases CPU time by order of magnitude

  • Uninformed global search takes even more time; poor solution quality indicates suboptimal trajectory pruning

  • Informed global search finds much better solutions in relatively little time. Value function drastically reduces search, and better pruning leads to better solutions

#LS: number of local searches performed to find paths

between elements of global search grid


  • No search: pole often falls, incurring large penalties; overall poor solution quality

  • Local search improves things a bit

  • Uninformed search finds better solutions than informed

    • Few grid cells in which pruning is required

    • Value function not optimistic, so informed search solutions suboptimal

  • Informed search reduces costs by order of magnitude with no increase in required CPU time

Planar Slider

  • Local search almost useless, and incurs massive CPU expense

  • Uninformed search decreases solution cost by 50%, but at even greater CPU expense

  • Informed search decreases solution cost by factor of 4, at no increase in CPU time

Using search with learned models
Using Search with Learned Models

  • Toy Example: Hill-Car

    • 72 simplex-interpolated value function

    • One nearest-neighbor function approximator per possible action used to learn dx/dt

    • States sufficiently far away from nearest neighbor optimistically assumed to be absorbing to encourage exploration

  • Average costs over first few hundred trials:

    • No search: 212

    • Local search: 127

    • Informed global search: 155

Using search with learned models1
Using Search with Learned Models

  • Problems do arise when using learned models:

    • Inaccuracies in models may cause global searches to fail. Not clear then if failure should be blamed on model inaccuracies or on insufficiently fine state space partitioning

    • Trajectories found will be inaccurate

      • Need adaptive closed-loop controller

      • Fortunately, we will get new data with which to increase the accuracy of our model

    • Model approximators must be fast and accurate

Avenues for future research
Avenues for Future Research

  • Extensions to nondeterministic systems?

  • Higher-dimensional problems

  • Better function approximators for model learning

  • Variable-resolution search grids

  • Optimistic value function generation?