Download Presentation

Loading in 3 Seconds

This presentation is the property of its rightful owner.

X

Sponsored Links

- 58 Views
- Uploaded on
- Presentation posted in: General

Sample-based Planning for Continuous Action Markov Decision Processes [on robots]

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Sample-based Planning for Continuous Action Markov Decision Processes[on robots]

Ari Weinstein

- Agent takes an action in the world, gets information including numerical reward; how does it learn to maximize that reward?

- Fundamental concept is exploration vs. exploitation. Must take actions in the world in order to learn about it, but eventually use what was learned to get high reward
- Bandits (stateless), Markov Decision Processes (state)

- I want to be here:
- Most RL algorithms are here [Knox Stone 09]:
- Some RL done with robots, but its rare, partly because its hard:

- RL Basics (bandits/Markov decision processes)
- Planning
- Bandits
- MDPs (novel)

- Model Building
- Exploring
- Acting (novel)

Composing pieces in this manner is novel

- Agent selects from k-arms, each with a distribution over rewards
- If we call the arm pulled at step tat , and the reward at trt~R(at)
- The regret is the difference in reward between the arm pulled and optimal arm; want cumulative regret to increase sub-linearly in t

- Partition action space by a tree
- Keep track of rewards for each subtree

- Blue is the bandit, red is the decomposition of HOO tree
- Thickness represents estimated reward

- Tree grows deeper and builds estimates at high resolution where reward is highest

- Exploration bonuses for number of samples and size of each subregion
- Regions with large volume and few samples are unknown, vice versa

- Pull arm in region according to maximal
- Has optimal regret: sqrt(t), independent of action dimension

- Composed of:
- States S (s, s’from S)
- Actions A (a from A)
- Transition distribution T(s’|s,a)
- Reward function R(s,a)
- Discount factor 0<γ<1

- Goal in is to find a policy π, a mapping from states to actions, maximizing expected long term discounted reward: where rtis reward at time t.
- Maximize long-term reward but favor immediate reward more heavily; decayed by γ. How much long term reward is possible is measured by value function

- Value of a state s under policy π:
- Q-value of an action a under the same definition:
- Optimally,

- In simplest case, agent can query domain for any:
<s,a>, get <r,s’>

- Flow:
- Domain informs agent of current state, s
- Agent queries domain for any number of <s,a,r,s’>
- Agent informs domain of true action to take, a
- Domain informs agent of new state, s

- Call this approach HOLOP – Hierarchical Open Loop Optimistic Planning
- Can treat the n-step planning problem as a large optimization problem
- Probability of splitting for a particular value of n proportional to γn
- Use HOO to optimize n-step planning, and then use action recommended in the first step.

- Just maximizing immediate reward, r1
- 1 dimensional; horizontal axis is splitting immediate action

- Maximizing r1+ γ r2
- 2 dimensional; horizontal axis is splitting immediate action, vertical is splitting next action

- Maximizing r1+ γ r2+ γ2 r3
- 3 dimensional; horizontal axis is splitting immediate action, vertical is splitting next action, depth is third action

- Planning of HOO/HOLOP (regret) improves at rate of sqrt(t), and independent of n
- Cost independent of |S|
- Open loop control means agnostic to state

- Anytime planner
- Open loop control means guarantees don’t exist in noisy domains

- If generative model is available, can use HOLOP directly

- Object with position(p) and velocity(v). Control acceleration (a).
R((p,v), a) = -(p2+a2)

- Stochasticity introduced with noise added to action command

- Planning done to 50 steps
- As an anytime planner, can stop and ask for an action anytime (left)
- Performance degrades similarly to optimal as noise varies (right)
- Action corrupted by +/- amount on x-axis, uniformly distributed. Action range is [-1.5, 1.5]

- HOLOP needs a model – where does it come from?
- KD Tree is a simple method of partitioning a space
- At each node, a split is made in some dimension in the region that node represents
- Various rules for deciding when, where to split

- To make an estimation, find the leaf that the query point fits in, use some method to make an estimation
- Commonly use the mean, I used linear regression

- This is used to build models of reward, transitions

- Samples drawn iid from Gaussian, labeled with pdf of Gaussian at point
- Piecewise linear fit of function

- Model and Environment now 2 pieces
- Generative model not required

- Model learns from environment when true <s,a,r,s’> samples available
- HOLOP uses learned model to plan

- Multi-resolution Exploration (MRE) [Nouri Littman 08] is a method that allows any RL algorithm to have efficient exploration
- Partitions space by a tree with each node representing knownness based on sample density in leaf
- When using samples, treat a sample as having a transition to Smax with probability calculated by tree. Smax is a state with a self transition and maximum possible reward
- While doing rollouts in HOLOP, MRE perturbs results in order to drive the agent to explore

- When HOLOP queries model for <s,a,r,s’>, MRE can step in and lie, say transition to Smax occurs instead
- Happens with high probability if <s,a> is not well known

+/- 0.05 units uniformly distributed noise on actions

- Explosion from discretization causes slow learning
- Near-optimal behavior with 10 trajectories, 2000 total samples
- Discrete algorithms have fast convergence to poor results, or slow convergence to good results

- Big domain: 2 action dimensions, 9 state dimensions
- Model building needs 11 input dimensions, 10 output dimensions

- Tested a number of algorithms in this domain, HOLOP has best performance
- Rmax still worse than HOLOP after 120 trials

- In simulations, can replan with HOLOP at every step
- In real time robotics control, can’t stop world while the algorithm plans.
- Need a method of caching planning online. This is tricky as model is updated online – when is policy updated?
- As with the other algorithms discussed (HOO, KD Trees, MRE) trees are used here.
- Adaptively partitioning a space based on sample concentration is both very powerful and very efficient

Algorithm:

- Start with root that covers entire state space, initialize with exploratory action
- As samples are experienced, add them to the tree, partition nodes based on some splitting rule
- Child nodes inherit action from parent on creation
Running it:

- Can ask tree for an action to take from a state; leaf returns its cached action.
- If planner is not busy request it to plan from the center of the area leaf represents in different thread

- Shade indicates action cached (black -1.5, white 1.5)
- Rewards:[-44.3, -5.5, -2.2, -4.6, -21.1, -2.6] (-1.3 optimal)

Order is red, green, blue, yellow, magenta, cyan

- In policy error, black indicates 0 error, white indicates maximum possible error in policy
- Minimal error along optimal trajectory – errors off optimal trajectory are acceptable

- Agent acts based on policy cached in TCP
- TCP sends request for updated policy, for state s

- There are many existing classes of RL algorithms, but almost all fail at least one requirement of real time robotics control. My approach addresses all requirements
- Hierarchical Open Loop Optimistic Planning is introduced:
- Operates directly in continuous action spaces, and is agnostic to state. No need for tuning
- No function approximation of value function

- Tree-based Cached Planning is introduced:
- Develops policies for regions when enough data is available to accurately determine policy
- Opportunistic updating of policy allows for real-time polling, with policy updated frequently

- Bubeck et al 08:, S. Bubeck, R. Munos, G. Stoltz, C. Szepesvari. Online Optimization in X-Armed Bandits, NIPS 08
- Kearns Mansour 99: Kearns M., Mansour S., Ng A., A Sparse Sampling Algorithm for Near-Optimal Planning in Large MDPs, IJCAI 99
- Knox Stone 09: Knox B. W., Stone, P. Interactively Shaping Agents via Human Reinforcement: The TAMER Framework. K-CAP, 2009.
- Moore 91: Efficient Memory-based Learning for Robot Control. PhD. Thesis; University of Cambridge, 1991.
- Nouri Littman 08:Nouri, A. and Littman, M. L. Mutli-resolution Exploration in Continuous Spaces. NIPS 2008
- Santamaría el al 98: Santamaría, Juan C., Sutton, R., and Ram, Ashwin. Experiments with reinforcement learning in problems with continuous state and action spaces. In Adaptive Behavior 6, 1998.
- Tassa et al.: Tassa, Yuval, Erez, Tom, and Smart, William D. Receding horizon diﬀerential dynamic programming. In Advances in Neural Information Processing Systems 21. 2007.