Applying online search techniques to reinforcement learning
1 / 29

Applying Online Search Techniques to Reinforcement Learning - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Applying Online Search Techniques to Reinforcement Learning. Scott Davies, Andrew Ng, and Andrew Moore Carnegie Mellon University. The Agony of Continuous State Spaces. Learning useful value functions for continuous-state optimal control problems can be difficult

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Applying Online Search Techniques to Reinforcement Learning

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Applying Online Search Techniques to Reinforcement Learning

Scott Davies, Andrew Ng, and Andrew Moore

Carnegie Mellon University

The Agony of Continuous State Spaces

  • Learning useful value functions for continuous-state optimal control problems can be difficult

    • Small inaccuracies/inconsistencies in approximated value functions can cause simple controllers to fail miserably

    • Accurate value functions can be very expensive to compute even in relatively low-dimensional spaces with perfectly accurate state transition models

Combining Value Functions With Online Search

  • Instead of modeling the value function accurately everywhere, we can perform online searches for good trajectories from the agent’s current position to compensate for value function inaccuracies

  • We examine two different types of search:

    • “Local” searches in which the agent performs a finite-depth look-ahead search

    • “Global” searches in which the agent searches for trajectories all the way to goal states

where RT is the reward accumulated along T

is the discount factor

xT is the state at the end of T

Typical One-Step “Search”

Given a value function V(x) over the state space, an agent typically uses a model to predict where each possible one-step trajectory T takes it, then chooses the trajectory that maximizes

RT +V(xT)

This takes O(|A|) time, where A is the set of possible actions.

Given a perfect V(x), this would lead to optimal behavior.

Local Search

  • An obvious possible extension: consider all possible d-step trajectories T, selecting the one that maximizes RT + dV(xT).

    • Computational expense: O(|A|d).

  • To make deeper searches more computationally tractable, we can limit agent to considering only trajectories in which the action is switched at most s times.

    • Computational expense:

      (considerably cheaper than full d-step search if s << d)



Local Search: Example

  • Two-dimensional state space (position + velocity)

  • Car must back up to take “running start” to make it

Search over 20-step trajectories

with at most one switch in actions

Using Local Search Online


  • From current state, consider all possible d-step trajectories T in which the action is changed at most s times

  • Perform the first action in the trajectory that maximizes RT + dV(xT).

    Let B denote the “parallel backup operator” such that

If s = (d-1), Local Search is formally equivalent to behaving greedily

with respect to the new value function Bd-1V.

Since V is typically arrived at through iterations of a much cruder backup

operator, this value function is often much more accurate than V.

Uninformed Global Search

  • Suppose we have a minimum-cost-to-goal problem in a continuous state space with nonnegative costs. Why not forget about explicitly calculating V and just extend the search from the current position all the way to the goal?

  • Problem: combinatorial explosion.

  • Possible solution:

    • Break state space into partitions, e.g. a uniform grid. (Can be represented sparsely.)

    • Use previously discussed local search procedure to find trajectories between partitions

    • Prune all but least-cost trajectory entering any given partition

Uninformed Global Search

  • Problems:

    • Still computationally expensive

    • Even with fine partitioning of state space, pruning the wrong trajectories can cause search to fail

Informed Global Search

  • Use approximate value function V to guide the selection of which points to search from next

  • Reasonably accurate V will cause search to stay along optimal path to goal: dramatic reduction in search time

  • V can help choose effective points within each partition from which to search, thereby improving solution quality

  • Uniformed Global Search same as “Informed” Global Search with V(x) = 0

Informed Global Search Algorithm

  • Let x0 be current state, and g(x0) be the grid element containing x0

  • Set g(x0)’s “representative state” to x0, and add g(x0) to priority queue P with priority V(x0)

  • Until goal state found or P empty:

    • Remove grid element g from top of P. Let x denote g’s “representative state.”

    • SEARCH-FROM(g, x)

  • If goal found, execute trajectory; otherwise signal failure

Informed Global Search Algorithm, cont’d


  • Starting from x, perform “local search” as described earlier, but prune the search wherever it reaches a different grid element g  g.

  • Each time another grid element g reached at state x:

    • If g previously SEARCHED-FROM, do nothing.

    • If g never previously reached, add g to P with priority RT(x0…x) +  |T|V(x), where T is trajectory from x0 to x. Set g’s “representative state” to x. Record trajectory from x to x.

    • If g previously reached but previous priority is lower than RT(x0…x) +  |T|V(x), update g s priority to RT(x0…x) +  |T|V(x) and set “representative state” to x. Record trajectory from x to x.

Informed Global Search Examples

7*7 simplex-interpolated V

13*13 simplex-interpolated V

Hill-car Search Trees

Informed Global Search as A*

  • Informed Global Search is essentially an A* search using the value function V as a search heuristic

  • Using A* with an optimistic heuristic function normally guarantees optimal path to the goal.

  • Uninformed global search effectively uses trivially optimistic heuristic V(s) = 0. Might we expect better solution quality with uninformed search than with non-optimistic crude approximate value function V?

  • Not necessarily! A crude approximate non-optimistic value function can improve solution quality by helping the algorithm avoid pruning wrong parts of search tree


  • Car on steep hill

  • State variables: position and velocity (2-d)

  • Actions: accelerate forward or backward

  • Goal: park near top

  • Random start states

  • Cost: total time to goal


  • Two-link planar robot acting in vertical plane under gravity

  • Underactuated joint at elbow; unactuated shoulder

  • Two angular positions & their velocities (4-d)

  • Goal: raise tip at least one link’s height above shoulder

  • Two actions: full torque clockwise / counterclockwise

  • Random starting positions

  • Cost: total time to goal





  • Upright pole attached to cart by unactuated joint

  • State: horizontal position of cart, angle of pole, and associated velocities (4-d)

  • Actions: accelerate left or right

  • Goal configuration: cart moved, pole balanced

  • Start with random x;  = 0

  • Per-step cost quadratic in distance from goal configuration

  • Big penalty if pole falls over

Goal configuration


Planar Slider

  • Puck sliding on bumpy 2-d surface

  • Two spatial variables & their velocities (4-d)

  • Actions: accelerate NW, NE, SW, or SE

  • Goal in NW corner

  • Random start states

  • Cost: total time to goal

Local Search Experiments


  • CPU Time and Solution cost vs. search depth d

  • No limits imposed on number of action switches (s=d)

  • Value function: 134 simplex-interpolation grid

Local Search Experiments


  • CPU Time and Solution cost vs. search depth d

  • Max. number of action switches fixed at 2 (s = 2)

  • Value function: 72 simplex-interpolated value function

Comparative experiments: Hill-Car

  • Local search: d=6, s=2

  • Global searches:

    • Local search between grid elements: d=20, s=1

    • 502 search grid resolution

  • 72 simplex-interpolated value function

Hill-Car results cont’d

  • Uninformed Global Search prunes wrong trajectories

  • Increase search grid to 1002 so this doesn’t happen:

    • Uninformed does near-optimal

    • Informed doesn’t: crude value function not optimistic

Failed search trajectory picture goes


Comparative Results: Four-d domains

All value functions: 134 simplex interpolations

All local searches between global search elements:

depth 20, with at max. 1 action switch (d=20, s=1)

  • Acrobot:

    • Local Search: depth 4; no action switch restriction (d=4,s=4)

    • Global: 504 search grid

  • Move-Cart-Pole: same as Acrobot

  • Slider:

    • Local Search: depth 10; max. 1 action switch (d=10,s=1)

    • Global: 204 search grid


  • Local search significantly improves solution quality, but increases CPU time by order of magnitude

  • Uninformed global search takes even more time; poor solution quality indicates suboptimal trajectory pruning

  • Informed global search finds much better solutions in relatively little time. Value function drastically reduces search, and better pruning leads to better solutions

#LS: number of local searches performed to find paths

between elements of global search grid


  • No search: pole often falls, incurring large penalties; overall poor solution quality

  • Local search improves things a bit

  • Uninformed search finds better solutions than informed

    • Few grid cells in which pruning is required

    • Value function not optimistic, so informed search solutions suboptimal

  • Informed search reduces costs by order of magnitude with no increase in required CPU time

Planar Slider

  • Local search almost useless, and incurs massive CPU expense

  • Uninformed search decreases solution cost by 50%, but at even greater CPU expense

  • Informed search decreases solution cost by factor of 4, at no increase in CPU time

Using Search with Learned Models

  • Toy Example: Hill-Car

    • 72 simplex-interpolated value function

    • One nearest-neighbor function approximator per possible action used to learn dx/dt

    • States sufficiently far away from nearest neighbor optimistically assumed to be absorbing to encourage exploration

  • Average costs over first few hundred trials:

    • No search: 212

    • Local search: 127

    • Informed global search: 155

Using Search with Learned Models

  • Problems do arise when using learned models:

    • Inaccuracies in models may cause global searches to fail. Not clear then if failure should be blamed on model inaccuracies or on insufficiently fine state space partitioning

    • Trajectories found will be inaccurate

      • Need adaptive closed-loop controller

      • Fortunately, we will get new data with which to increase the accuracy of our model

    • Model approximators must be fast and accurate

Avenues for Future Research

  • Extensions to nondeterministic systems?

  • Higher-dimensional problems

  • Better function approximators for model learning

  • Variable-resolution search grids

  • Optimistic value function generation?

  • Login