Sample based planning for continuous action markov decision processes on robots
Sponsored Links
This presentation is the property of its rightful owner.
1 / 31

Sample-based Planning for Continuous Action Markov Decision Processes [on robots] PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Sample-based Planning for Continuous Action Markov Decision Processes [on robots]. Ari Weinstein. Reinforcement Learning (RL). Agent takes an action in the world, gets information including numerical reward; how does it learn to maximize that reward?.

Download Presentation

Sample-based Planning for Continuous Action Markov Decision Processes [on robots]

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Sample-based Planning for Continuous Action Markov Decision Processes[on robots]

Ari Weinstein

Reinforcement Learning (RL)

  • Agent takes an action in the world, gets information including numerical reward; how does it learn to maximize that reward?

  • Fundamental concept is exploration vs. exploitation. Must take actions in the world in order to learn about it, but eventually use what was learned to get high reward

  • Bandits (stateless), Markov Decision Processes (state)

The Goal

  • I want to be here:

  • Most RL algorithms are here [Knox Stone 09]:

  • Some RL done with robots, but its rare, partly because its hard:


  • RL Basics (bandits/Markov decision processes)

  • Planning

    • Bandits

    • MDPs (novel)

  • Model Building

  • Exploring

  • Acting (novel)

Composing pieces in this manner is novel

k-armed Bandits

  • Agent selects from k-arms, each with a distribution over rewards

  • If we call the arm pulled at step tat , and the reward at trt~R(at)

  • The regret is the difference in reward between the arm pulled and optimal arm; want cumulative regret to increase sub-linearly in t

Hierarchical Optimistic Optimization (HOO)[Bubeck et al. 08]

  • Partition action space by a tree

    • Keep track of rewards for each subtree

  • Blue is the bandit, red is the decomposition of HOO tree

    • Thickness represents estimated reward

  • Tree grows deeper and builds estimates at high resolution where reward is highest

HOO continued

  • Exploration bonuses for number of samples and size of each subregion

    • Regions with large volume and few samples are unknown, vice versa

  • Pull arm in region according to maximal

  • Has optimal regret: sqrt(t), independent of action dimension

Markov Decision Processes

  • Composed of:

    • States S (s, s’from S)

    • Actions A (a from A)

    • Transition distribution T(s’|s,a)

    • Reward function R(s,a)

    • Discount factor 0<γ<1

  • Goal in is to find a policy π, a mapping from states to actions, maximizing expected long term discounted reward: where rtis reward at time t.

  • Maximize long-term reward but favor immediate reward more heavily; decayed by γ. How much long term reward is possible is measured by value function

Value Function

  • Value of a state s under policy π:

  • Q-value of an action a under the same definition:

  • Optimally,

Sample-based Planning [Kearns Mansour 99]

  • In simplest case, agent can query domain for any:

    <s,a>, get <r,s’>

  • Flow:

    • Domain informs agent of current state, s

    • Agent queries domain for any number of <s,a,r,s’>

    • Agent informs domain of true action to take, a

    • Domain informs agent of new state, s

Planning with HOO (HOLOP)

  • Call this approach HOLOP – Hierarchical Open Loop Optimistic Planning

  • Can treat the n-step planning problem as a large optimization problem

  • Probability of splitting for a particular value of n proportional to γn

  • Use HOO to optimize n-step planning, and then use action recommended in the first step.

1-Step Lookahead in HOLOP

  • Just maximizing immediate reward, r1

  • 1 dimensional; horizontal axis is splitting immediate action

2-Step Lookahead in HOLOP

  • Maximizing r1+ γ r2

  • 2 dimensional; horizontal axis is splitting immediate action, vertical is splitting next action

3-Step Lookahead in HOLOP

  • Maximizing r1+ γ r2+ γ2 r3

  • 3 dimensional; horizontal axis is splitting immediate action, vertical is splitting next action, depth is third action

Properties of HOLOP

  • Planning of HOO/HOLOP (regret) improves at rate of sqrt(t), and independent of n

  • Cost independent of |S|

    • Open loop control means agnostic to state

  • Anytime planner

  • Open loop control means guarantees don’t exist in noisy domains

Learning System Update

  • If generative model is available, can use HOLOP directly

HOLOP in Practice:Double Integrator Domain [Santamaría et al. 98]

  • Object with position(p) and velocity(v). Control acceleration (a).

    R((p,v), a) = -(p2+a2)

    • Stochasticity introduced with noise added to action command

  • Planning done to 50 steps

  • As an anytime planner, can stop and ask for an action anytime (left)

  • Performance degrades similarly to optimal as noise varies (right)

    • Action corrupted by +/- amount on x-axis, uniformly distributed. Action range is [-1.5, 1.5]

Building a Model: KD Trees [Moore 91]

  • HOLOP needs a model – where does it come from?

  • KD Tree is a simple method of partitioning a space

  • At each node, a split is made in some dimension in the region that node represents

    • Various rules for deciding when, where to split

  • To make an estimation, find the leaf that the query point fits in, use some method to make an estimation

    • Commonly use the mean, I used linear regression

  • This is used to build models of reward, transitions

KD Trees Approximating Gaussian

  • Samples drawn iid from Gaussian, labeled with pdf of Gaussian at point

  • Piecewise linear fit of function

Learning System Update

  • Model and Environment now 2 pieces

    • Generative model not required

  • Model learns from environment when true <s,a,r,s’> samples available

  • HOLOP uses learned model to plan

Efficient Exploration

  • Multi-resolution Exploration (MRE) [Nouri Littman 08] is a method that allows any RL algorithm to have efficient exploration

  • Partitions space by a tree with each node representing knownness based on sample density in leaf

  • When using samples, treat a sample as having a transition to Smax with probability calculated by tree. Smax is a state with a self transition and maximum possible reward

  • While doing rollouts in HOLOP, MRE perturbs results in order to drive the agent to explore

Learning System Update

  • When HOLOP queries model for <s,a,r,s’>, MRE can step in and lie, say transition to Smax occurs instead

    • Happens with high probability if <s,a> is not well known

Double Integrator

+/- 0.05 units uniformly distributed noise on actions

  • Explosion from discretization causes slow learning

  • Near-optimal behavior with 10 trajectories, 2000 total samples

  • Discrete algorithms have fast convergence to poor results, or slow convergence to good results

3 Link Swimmer[Tassa et al.]

  • Big domain: 2 action dimensions, 9 state dimensions

    • Model building needs 11 input dimensions, 10 output dimensions

  • Tested a number of algorithms in this domain, HOLOP has best performance

  • Rmax still worse than HOLOP after 120 trials

Next Step: Doing it all Quickly

  • In simulations, can replan with HOLOP at every step

  • In real time robotics control, can’t stop world while the algorithm plans.

  • Need a method of caching planning online. This is tricky as model is updated online – when is policy updated?

  • As with the other algorithms discussed (HOO, KD Trees, MRE) trees are used here.

    • Adaptively partitioning a space based on sample concentration is both very powerful and very efficient

[Preliminary]TCP: Tree-based Cached Planning


  • Start with root that covers entire state space, initialize with exploratory action

  • As samples are experienced, add them to the tree, partition nodes based on some splitting rule

  • Child nodes inherit action from parent on creation

    Running it:

  • Can ask tree for an action to take from a state; leaf returns its cached action.

    • If planner is not busy request it to plan from the center of the area leaf represents in different thread

Cached PoliciesDouble Integrator with Generative Model

  • Shade indicates action cached (black -1.5, white 1.5)

  • Rewards:[-44.3, -5.5, -2.2, -4.6, -21.1, -2.6] (-1.3 optimal)

Close Up

Order is red, green, blue, yellow, magenta, cyan

  • In policy error, black indicates 0 error, white indicates maximum possible error in policy

  • Minimal error along optimal trajectory – errors off optimal trajectory are acceptable

Learning System Update:Our Happy Forest

  • Agent acts based on policy cached in TCP

  • TCP sends request for updated policy, for state s


  • There are many existing classes of RL algorithms, but almost all fail at least one requirement of real time robotics control. My approach addresses all requirements

  • Hierarchical Open Loop Optimistic Planning is introduced:

    • Operates directly in continuous action spaces, and is agnostic to state. No need for tuning

    • No function approximation of value function

  • Tree-based Cached Planning is introduced:

    • Develops policies for regions when enough data is available to accurately determine policy

    • Opportunistic updating of policy allows for real-time polling, with policy updated frequently


  • Bubeck et al 08:, S. Bubeck, R. Munos, G. Stoltz, C. Szepesvari. Online Optimization in X-Armed Bandits, NIPS 08

  • Kearns Mansour 99: Kearns M., Mansour S., Ng A., A Sparse Sampling Algorithm for Near-Optimal Planning in Large MDPs, IJCAI 99

  • Knox Stone 09: Knox B. W., Stone, P. Interactively Shaping Agents via Human Reinforcement: The TAMER Framework. K-CAP, 2009.

  • Moore 91: Efficient Memory-based Learning for Robot Control. PhD. Thesis; University of Cambridge, 1991.

  • Nouri Littman 08:Nouri, A. and Littman, M. L. Mutli-resolution Exploration in Continuous Spaces. NIPS 2008

  • Santamaría el al 98: Santamaría, Juan C., Sutton, R., and Ram, Ashwin. Experiments with reinforcement learning in problems with continuous state and action spaces. In Adaptive Behavior 6, 1998.

  • Tassa et al.: Tassa, Yuval, Erez, Tom, and Smart, William D. Receding horizon differential dynamic programming. In Advances in Neural Information Processing Systems 21. 2007.

  • Login