Sample-based Planning for Continuous Action Markov Decision Processes [on robots]

Sample-based Planning for Continuous Action Markov Decision Processes[on robots] Ari Weinstein

Reinforcement Learning (RL) • Agent takes an action in the world, gets information including numerical reward; how does it learn to maximize that reward? • Fundamental concept is exploration vs. exploitation. Must take actions in the world in order to learn about it, but eventually use what was learned to get high reward • Bandits (stateless), Markov Decision Processes (state)

The Goal • I want to be here: • Most RL algorithms are here [Knox Stone 09]: • Some RL done with robots, but its rare, partly because its hard:

Overview • RL Basics (bandits/Markov decision processes) • Planning • Bandits • MDPs (novel) • Model Building • Exploring • Acting (novel) Composing pieces in this manner is novel

k-armed Bandits • Agent selects from k-arms, each with a distribution over rewards • If we call the arm pulled at step tat , and the reward at trt~R(at) • The regret is the difference in reward between the arm pulled and optimal arm; want cumulative regret to increase sub-linearly in t

Hierarchical Optimistic Optimization (HOO)[Bubeck et al. 08] • Partition action space by a tree • Keep track of rewards for each subtree • Blue is the bandit, red is the decomposition of HOO tree • Thickness represents estimated reward • Tree grows deeper and builds estimates at high resolution where reward is highest

HOO continued • Exploration bonuses for number of samples and size of each subregion • Regions with large volume and few samples are unknown, vice versa • Pull arm in region according to maximal • Has optimal regret: sqrt(t), independent of action dimension

Markov Decision Processes • Composed of: • States S (s, s’from S) • Actions A (a from A) • Transition distribution T(s’|s,a) • Reward function R(s,a) • Discount factor 0<γ<1 • Goal in is to find a policy π, a mapping from states to actions, maximizing expected long term discounted reward: where rtis reward at time t. • Maximize long-term reward but favor immediate reward more heavily; decayed by γ. How much long term reward is possible is measured by value function

Value Function • Value of a state s under policy π: • Q-value of an action a under the same definition: • Optimally,

Sample-based Planning [Kearns Mansour 99] • In simplest case, agent can query domain for any: <s,a>, get <r,s’> • Flow: • Domain informs agent of current state, s • Agent queries domain for any number of <s,a,r,s’> • Agent informs domain of true action to take, a • Domain informs agent of new state, s

Planning with HOO (HOLOP) • Call this approach HOLOP – Hierarchical Open Loop Optimistic Planning • Can treat the n-step planning problem as a large optimization problem • Probability of splitting for a particular value of n proportional to γn • Use HOO to optimize n-step planning, and then use action recommended in the first step.

1-Step Lookahead in HOLOP • Just maximizing immediate reward, r1 • 1 dimensional; horizontal axis is splitting immediate action

2-Step Lookahead in HOLOP • Maximizing r1+ γ r2 • 2 dimensional; horizontal axis is splitting immediate action, vertical is splitting next action

3-Step Lookahead in HOLOP • Maximizing r1+ γ r2+ γ2 r3 • 3 dimensional; horizontal axis is splitting immediate action, vertical is splitting next action, depth is third action

Properties of HOLOP • Planning of HOO/HOLOP (regret) improves at rate of sqrt(t), and independent of n • Cost independent of |S| • Open loop control means agnostic to state • Anytime planner • Open loop control means guarantees don’t exist in noisy domains

Learning System Update • If generative model is available, can use HOLOP directly

HOLOP in Practice:Double Integrator Domain [Santamaría et al. 98] • Object with position(p) and velocity(v). Control acceleration (a). R((p,v), a) = -(p2+a2) • Stochasticity introduced with noise added to action command • Planning done to 50 steps • As an anytime planner, can stop and ask for an action anytime (left) • Performance degrades similarly to optimal as noise varies (right) • Action corrupted by +/- amount on x-axis, uniformly distributed. Action range is [-1.5, 1.5]

Building a Model: KD Trees [Moore 91] • HOLOP needs a model – where does it come from? • KD Tree is a simple method of partitioning a space • At each node, a split is made in some dimension in the region that node represents • Various rules for deciding when, where to split • To make an estimation, find the leaf that the query point fits in, use some method to make an estimation • Commonly use the mean, I used linear regression • This is used to build models of reward, transitions

KD Trees Approximating Gaussian • Samples drawn iid from Gaussian, labeled with pdf of Gaussian at point • Piecewise linear fit of function

Learning System Update • Model and Environment now 2 pieces • Generative model not required • Model learns from environment when true <s,a,r,s’> samples available • HOLOP uses learned model to plan

Efficient Exploration • Multi-resolution Exploration (MRE) [Nouri Littman 08] is a method that allows any RL algorithm to have efficient exploration • Partitions space by a tree with each node representing knownness based on sample density in leaf • When using samples, treat a sample as having a transition to Smax with probability calculated by tree. Smax is a state with a self transition and maximum possible reward • While doing rollouts in HOLOP, MRE perturbs results in order to drive the agent to explore

Learning System Update • When HOLOP queries model for <s,a,r,s’>, MRE can step in and lie, say transition to Smax occurs instead • Happens with high probability if <s,a> is not well known

Double Integrator +/- 0.05 units uniformly distributed noise on actions • Explosion from discretization causes slow learning • Near-optimal behavior with 10 trajectories, 2000 total samples • Discrete algorithms have fast convergence to poor results, or slow convergence to good results

3 Link Swimmer[Tassa et al.] • Big domain: 2 action dimensions, 9 state dimensions • Model building needs 11 input dimensions, 10 output dimensions • Tested a number of algorithms in this domain, HOLOP has best performance • Rmax still worse than HOLOP after 120 trials

Next Step: Doing it all Quickly • In simulations, can replan with HOLOP at every step • In real time robotics control, can’t stop world while the algorithm plans. • Need a method of caching planning online. This is tricky as model is updated online – when is policy updated? • As with the other algorithms discussed (HOO, KD Trees, MRE) trees are used here. • Adaptively partitioning a space based on sample concentration is both very powerful and very efficient

[Preliminary]TCP: Tree-based Cached Planning Algorithm: • Start with root that covers entire state space, initialize with exploratory action • As samples are experienced, add them to the tree, partition nodes based on some splitting rule • Child nodes inherit action from parent on creation Running it: • Can ask tree for an action to take from a state; leaf returns its cached action. • If planner is not busy request it to plan from the center of the area leaf represents in different thread

Cached PoliciesDouble Integrator with Generative Model • Shade indicates action cached (black -1.5, white 1.5) • Rewards:[-44.3, -5.5, -2.2, -4.6, -21.1, -2.6] (-1.3 optimal)

Close Up Order is red, green, blue, yellow, magenta, cyan • In policy error, black indicates 0 error, white indicates maximum possible error in policy • Minimal error along optimal trajectory – errors off optimal trajectory are acceptable

Learning System Update:Our Happy Forest • Agent acts based on policy cached in TCP • TCP sends request for updated policy, for state s

Conclusions • There are many existing classes of RL algorithms, but almost all fail at least one requirement of real time robotics control. My approach addresses all requirements • Hierarchical Open Loop Optimistic Planning is introduced: • Operates directly in continuous action spaces, and is agnostic to state. No need for tuning • No function approximation of value function • Tree-based Cached Planning is introduced: • Develops policies for regions when enough data is available to accurately determine policy • Opportunistic updating of policy allows for real-time polling, with policy updated frequently

Citations • Bubeck et al 08:, S. Bubeck, R. Munos, G. Stoltz, C. Szepesvari. Online Optimization in X-Armed Bandits, NIPS 08 • Kearns Mansour 99: Kearns M., Mansour S., Ng A., A Sparse Sampling Algorithm for Near-Optimal Planning in Large MDPs, IJCAI 99 • Knox Stone 09: Knox B. W., Stone, P. Interactively Shaping Agents via Human Reinforcement: The TAMER Framework. K-CAP, 2009. • Moore 91: Efficient Memory-based Learning for Robot Control. PhD. Thesis; University of Cambridge, 1991. • Nouri Littman 08:Nouri, A. and Littman, M. L. Mutli-resolution Exploration in Continuous Spaces. NIPS 2008 • Santamaría el al 98: Santamaría, Juan C., Sutton, R., and Ram, Ashwin. Experiments with reinforcement learning in problems with continuous state and action spaces. In Adaptive Behavior 6, 1998. • Tassa et al.: Tassa, Yuval, Erez, Tom, and Smart, William D. Receding horizon diﬀerential dynamic programming. In Advances in Neural Information Processing Systems 21. 2007.

Sample-based Planning for Continuous Action Markov Decision Processes [on robots]