Applying Online Search Techniques to Reinforcement Learning. Scott Davies, Andrew Ng, and Andrew Moore Carnegie Mellon University. The Agony of Continuous State Spaces. Learning useful value functions for continuous-state optimal control problems can be difficult
Scott Davies, Andrew Ng, and Andrew Moore
Carnegie Mellon University
where RT is the reward accumulated along T
is the discount factor
xT is the state at the end of TTypical One-Step “Search”
Given a value function V(x) over the state space, an agent typically uses a model to predict where each possible one-step trajectory T takes it, then chooses the trajectory that maximizes
This takes O(|A|) time, where A is the set of possible actions.
Given a perfect V(x), this would lead to optimal behavior.
(considerably cheaper than full d-step search if s << d)
PositionLocal Search: Example
Search over 20-step trajectories
with at most one switch in actions
Let B denote the “parallel backup operator” such that
If s = (d-1), Local Search is formally equivalent to behaving greedily
with respect to the new value function Bd-1V.
Since V is typically arrived at through iterations of a much cruder backup
operator, this value function is often much more accurate than V.
7*7 simplex-interpolated V
13*13 simplex-interpolated V
Hill-car Search Trees
Failed search trajectory picture goes
All value functions: 134 simplex interpolations
All local searches between global search elements:
depth 20, with at max. 1 action switch (d=20, s=1)
#LS: number of local searches performed to find paths
between elements of global search grid