Applying Online Search Techniques to Reinforcement Learning. Scott Davies, Andrew Ng, and Andrew Moore Carnegie Mellon University. The Agony of Continuous State Spaces. Learning useful value functions for continuous-state optimal control problems can be difficult
Applying Online Search Techniques to Reinforcement Learning
Scott Davies, Andrew Ng, and Andrew Moore
Carnegie Mellon University
where RT is the reward accumulated along T
is the discount factor
xT is the state at the end of T
Given a value function V(x) over the state space, an agent typically uses a model to predict where each possible one-step trajectory T takes it, then chooses the trajectory that maximizes
This takes O(|A|) time, where A is the set of possible actions.
Given a perfect V(x), this would lead to optimal behavior.
(considerably cheaper than full d-step search if s << d)
Search over 20-step trajectories
with at most one switch in actions
Let B denote the “parallel backup operator” such that
If s = (d-1), Local Search is formally equivalent to behaving greedily
with respect to the new value function Bd-1V.
Since V is typically arrived at through iterations of a much cruder backup
operator, this value function is often much more accurate than V.
7*7 simplex-interpolated V
13*13 simplex-interpolated V
Hill-car Search Trees
Failed search trajectory picture goes
All value functions: 134 simplex interpolations
All local searches between global search elements:
depth 20, with at max. 1 action switch (d=20, s=1)
#LS: number of local searches performed to find paths
between elements of global search grid