Applying Online Search Techniques to Reinforcement Learning

Download Presentation

Applying Online Search Techniques to Reinforcement Learning

Loading in 2 Seconds...

- 58 Views
- Uploaded on
- Presentation posted in: General

Applying Online Search Techniques to Reinforcement Learning

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Applying Online Search Techniques to Reinforcement Learning

Scott Davies, Andrew Ng, and Andrew Moore

Carnegie Mellon University

- Learning useful value functions for continuous-state optimal control problems can be difficult
- Small inaccuracies/inconsistencies in approximated value functions can cause simple controllers to fail miserably
- Accurate value functions can be very expensive to compute even in relatively low-dimensional spaces with perfectly accurate state transition models

- Instead of modeling the value function accurately everywhere, we can perform online searches for good trajectories from the agent’s current position to compensate for value function inaccuracies
- We examine two different types of search:
- “Local” searches in which the agent performs a finite-depth look-ahead search
- “Global” searches in which the agent searches for trajectories all the way to goal states

where RT is the reward accumulated along T

is the discount factor

xT is the state at the end of T

Given a value function V(x) over the state space, an agent typically uses a model to predict where each possible one-step trajectory T takes it, then chooses the trajectory that maximizes

RT +V(xT)

This takes O(|A|) time, where A is the set of possible actions.

Given a perfect V(x), this would lead to optimal behavior.

- An obvious possible extension: consider all possible d-step trajectories T, selecting the one that maximizes RT + dV(xT).
- Computational expense: O(|A|d).

- To make deeper searches more computationally tractable, we can limit agent to considering only trajectories in which the action is switched at most s times.
- Computational expense:
(considerably cheaper than full d-step search if s << d)

- Computational expense:

Velocity

Position

- Two-dimensional state space (position + velocity)
- Car must back up to take “running start” to make it

Search over 20-step trajectories

with at most one switch in actions

Repeat:

- From current state, consider all possible d-step trajectories T in which the action is changed at most s times
- Perform the first action in the trajectory that maximizes RT + dV(xT).
Let B denote the “parallel backup operator” such that

If s = (d-1), Local Search is formally equivalent to behaving greedily

with respect to the new value function Bd-1V.

Since V is typically arrived at through iterations of a much cruder backup

operator, this value function is often much more accurate than V.

- Suppose we have a minimum-cost-to-goal problem in a continuous state space with nonnegative costs. Why not forget about explicitly calculating V and just extend the search from the current position all the way to the goal?
- Problem: combinatorial explosion.
- Possible solution:
- Break state space into partitions, e.g. a uniform grid. (Can be represented sparsely.)
- Use previously discussed local search procedure to find trajectories between partitions
- Prune all but least-cost trajectory entering any given partition

- Problems:
- Still computationally expensive
- Even with fine partitioning of state space, pruning the wrong trajectories can cause search to fail

- Use approximate value function V to guide the selection of which points to search from next
- Reasonably accurate V will cause search to stay along optimal path to goal: dramatic reduction in search time
- V can help choose effective points within each partition from which to search, thereby improving solution quality
- Uniformed Global Search same as “Informed” Global Search with V(x) = 0

- Let x0 be current state, and g(x0) be the grid element containing x0
- Set g(x0)’s “representative state” to x0, and add g(x0) to priority queue P with priority V(x0)
- Until goal state found or P empty:
- Remove grid element g from top of P. Let x denote g’s “representative state.”
- SEARCH-FROM(g, x)

- If goal found, execute trajectory; otherwise signal failure

SEARCH-FROM(g, x):

- Starting from x, perform “local search” as described earlier, but prune the search wherever it reaches a different grid element g g.
- Each time another grid element g reached at state x:
- If g previously SEARCHED-FROM, do nothing.
- If g never previously reached, add g to P with priority RT(x0…x) + |T|V(x), where T is trajectory from x0 to x. Set g’s “representative state” to x. Record trajectory from x to x.
- If g previously reached but previous priority is lower than RT(x0…x) + |T|V(x), update g s priority to RT(x0…x) + |T|V(x) and set “representative state” to x. Record trajectory from x to x.

7*7 simplex-interpolated V

13*13 simplex-interpolated V

Hill-car Search Trees

- Informed Global Search is essentially an A* search using the value function V as a search heuristic
- Using A* with an optimistic heuristic function normally guarantees optimal path to the goal.
- Uninformed global search effectively uses trivially optimistic heuristic V(s) = 0. Might we expect better solution quality with uninformed search than with non-optimistic crude approximate value function V?
- Not necessarily! A crude approximate non-optimistic value function can improve solution quality by helping the algorithm avoid pruning wrong parts of search tree

- Car on steep hill
- State variables: position and velocity (2-d)
- Actions: accelerate forward or backward
- Goal: park near top
- Random start states
- Cost: total time to goal

- Two-link planar robot acting in vertical plane under gravity
- Underactuated joint at elbow; unactuated shoulder
- Two angular positions & their velocities (4-d)
- Goal: raise tip at least one link’s height above shoulder
- Two actions: full torque clockwise / counterclockwise
- Random starting positions
- Cost: total time to goal

Goal

1

2

- Upright pole attached to cart by unactuated joint
- State: horizontal position of cart, angle of pole, and associated velocities (4-d)
- Actions: accelerate left or right
- Goal configuration: cart moved, pole balanced
- Start with random x; = 0
- Per-step cost quadratic in distance from goal configuration
- Big penalty if pole falls over

Goal configuration

x

- Puck sliding on bumpy 2-d surface
- Two spatial variables & their velocities (4-d)
- Actions: accelerate NW, NE, SW, or SE
- Goal in NW corner
- Random start states
- Cost: total time to goal

Move-Cart-Pole

- CPU Time and Solution cost vs. search depth d
- No limits imposed on number of action switches (s=d)
- Value function: 134 simplex-interpolation grid

Hill-car

- CPU Time and Solution cost vs. search depth d
- Max. number of action switches fixed at 2 (s = 2)
- Value function: 72 simplex-interpolated value function

- Local search: d=6, s=2
- Global searches:
- Local search between grid elements: d=20, s=1
- 502 search grid resolution

- 72 simplex-interpolated value function

- Uninformed Global Search prunes wrong trajectories
- Increase search grid to 1002 so this doesn’t happen:
- Uninformed does near-optimal
- Informed doesn’t: crude value function not optimistic

Failed search trajectory picture goes

here

All value functions: 134 simplex interpolations

All local searches between global search elements:

depth 20, with at max. 1 action switch (d=20, s=1)

- Acrobot:
- Local Search: depth 4; no action switch restriction (d=4,s=4)
- Global: 504 search grid

- Move-Cart-Pole: same as Acrobot
- Slider:
- Local Search: depth 10; max. 1 action switch (d=10,s=1)
- Global: 204 search grid

- Local search significantly improves solution quality, but increases CPU time by order of magnitude
- Uninformed global search takes even more time; poor solution quality indicates suboptimal trajectory pruning
- Informed global search finds much better solutions in relatively little time. Value function drastically reduces search, and better pruning leads to better solutions

#LS: number of local searches performed to find paths

between elements of global search grid

Move-Cart-Pole

- No search: pole often falls, incurring large penalties; overall poor solution quality
- Local search improves things a bit
- Uninformed search finds better solutions than informed
- Few grid cells in which pruning is required
- Value function not optimistic, so informed search solutions suboptimal

- Informed search reduces costs by order of magnitude with no increase in required CPU time

Planar Slider

- Local search almost useless, and incurs massive CPU expense
- Uninformed search decreases solution cost by 50%, but at even greater CPU expense
- Informed search decreases solution cost by factor of 4, at no increase in CPU time

- Toy Example: Hill-Car
- 72 simplex-interpolated value function
- One nearest-neighbor function approximator per possible action used to learn dx/dt
- States sufficiently far away from nearest neighbor optimistically assumed to be absorbing to encourage exploration

- Average costs over first few hundred trials:
- No search: 212
- Local search: 127
- Informed global search: 155

- Problems do arise when using learned models:
- Inaccuracies in models may cause global searches to fail. Not clear then if failure should be blamed on model inaccuracies or on insufficiently fine state space partitioning
- Trajectories found will be inaccurate
- Need adaptive closed-loop controller
- Fortunately, we will get new data with which to increase the accuracy of our model

- Model approximators must be fast and accurate

- Extensions to nondeterministic systems?
- Higher-dimensional problems
- Better function approximators for model learning
- Variable-resolution search grids
- Optimistic value function generation?