Planning with Uncertainty in Continuous Domains

Planning with Uncertainty in Continuous Domains Richard Dearden No fixed abode  Joint work with: Zhengzhu Feng U. Mass Amherst Nicolas Meuleau, Dave Smith NASA Ames Richard Washington Google

Motivation Problem: Scientists are interested in many potential targets. How to decide which to pursue? Panorama Image rock Image Rock ? Dig Trench

Motivation Image rock ? Different value targets Time? Panorama Likelihood of Success? Image Rock Power? Dig Trench

Outline • Introduction • Problem Definition • A Classical Planning Approach • The Markov Decision Problem approach • Final Comments

Problem Definition • Aim: To select a “plan” that “maximises” long-term expected reward received given: • Limited resources (time, power, memory capacity). • Uncertainty about the resources required to carry out each action (“how long will it take to drive to that rock?”). • Hard safety constraints over action applicability (must keep enough reserve power to maintain the rover). • Uncertain action outcomes (some targets may be unreachable, instruments may be impossible to place). • Difficulties: • Continuous resources. • Actions have uncertain continuous outcomes. • Goal selection and optimization • Also possibly concurrency, …

Possible Approaches • Contingency Planning: • Generate a single plan, but with branches. • Branch based on the actual outcome of the actions performed so far in the plan. • Policy-based Planning: • A plan is now a policy: a mapping from states to actions. • There’s something to do no matter what the outcome of the actions so far. • More general, but harder to compute. Power > 5Ah Power  5 Ah

An Example Problem E > .02 Ah  = .01 Ah  = 0 Ah E > 10 Ah  = 5 Ah  = 2.5 Ah E > .1 Ah  = .05 Ah  = .02 Ah E > .6 Ah  = .2 Ah  = .2 Ah t [9:00, 14:30]  = 5s  = 1s HiRes V = 10 t [10:00, 14:00]  = 600s  = 60s E > 3 Ah  = 2 Ah  = .5 Ah  = 1000s  = 500s  = 60s  = 1s  = 40s  = 20s Visual servo (.2, -.15) Dig(60) Drive (-2) NIR V = 100 t [9:00, 16:00]  = 5s  = 1s t [10:00, 13:50]  = 600s  = 60s  = 120s  = 20s Lo res Rock finder NIR V = 50 V = 5 E > .02 Ah  = .01 Ah  = 0 Ah E > .12 Ah  = .1 Ah  = .01 Ah E > 3 Ah  = 2 Ah  = .5 Ah

Expected Value 20 13:20 15 13:40 10 14:00 Power 5 Start time 14:20 14:40 Value Function

Value Function E > .02 Ah  = .01 Ah  = 0 Ah E > 10 Ah  = 5 Ah  = 2.5 Ah E > .1 Ah  = .05 Ah  = .02 Ah E > .6 Ah  = .2 Ah  = .2 Ah t [9:00, 14:30]  = 5s  = 1s HiRes V = 10 t [10:00, 14:00]  = 600s  = 60s E > 3 Ah  = 2 Ah  = .5 Ah  = 1000s  = 500s  = 60s  = 1s  = 40s  = 20s Visual servo (.2, -.15) Dig(60) Drive (-2) NIR V = 100 t [9:00, 16:00]  = 5s  = 1s t [10:00, 13:50]  = 600s  = 60s  = 120s  = 20s 20 13:20 15 Lo res Rock finder NIR 13:40 V = 50 V = 5 10 14:00 E > .02 Ah  = .01 Ah  = 0 Ah E > .12 Ah  = .1 Ah  = .01 Ah E > 3 Ah  = 2 Ah  = .5 Ah 5 Power 14:20 Start time 14:40

Plans • Contingency Planning: • Policy-based Planning: • Regions of state space have corresponding actions. Visual servo (.2, -.15) Dig(60) Drive (-2) NIR Time > 13:40 or Power < 10 Lo res Rock finder NIR VisualServo VisualServo Hi-Res Time < 13:40 and Power > 10 : VisualServo Time > 14:15 and Time < 14:30 and Power > 10 : Hi-Res … Lo-Res

Vm Vb r Contingency Planning ? ? ? ? 1. Seed plan 2. Identify best branch point 3. Generate a contingency branch 4. Evaluate & integrate the branch Construct plangraph Back-propagate value tables Compute gain

Construct Plangraph g1 g2 g3 g4

Add Resource Usages and Values V1 g1 V2 g2 V3 g3 V4 g4

Value Graphs V1 g1 r V2 g2 r V3 g3 r V4 g4 r

v r v r v r Propagate Value Graphs V1 g1 r V2 g2 r V3 g3 r V4 g4 r

p p .2 .1 r r 5 10 5 15 v v v r r r 25 10 5 15 Simple Back-propagation V

p p .2 .1 r r 5 10 5 15 v r Constraints r > 15 V v v r r 25 10 5 15

p p .2 .1 r r 5 10 5 15 p v v v r v 5 r r r r 25 10 5 5 15 15 v r 20 10 Conjunctions {t} {t} q p V t s {q} {q}

p p .2 .1 r r 5 10 5 15 p v v r 5 r r 25 25 10 10 v v r r 20 20 10 10 Back-propagating Conditions {t} q p V t s {q}

p p .2 .1 r r 5 10 5 15 v v p v v 30 30 15 15 r 5 r r 25 25 10 10 v v r r 20 20 10 10 Back-propagating Conditions {t} q p r V t s {q}

CABD CADB ACBD ACDB Which Orderings ABCD A B D C CDAB

p v1 v1 .2 r r 5 10 r 5 10 p r v2 v2 5 v1 r r v2 10 20 r 10 20 Combining Tables Max v2 v1 r 10 20 p .1 r 10 20

p v1+ v2 v1 v1 .2 v2 r r 5 10 v1 r r 5 10 10 20 30 p r v2 5 r v2 r 10 20 Achieving Both Goals v1+ v2 v2 v1 r p 10 20 30 .1 r 10 20

V r V r Estimating Branch Value V1 Max V2 V V r r V3 V4

Estimating Branch Value resource probability plan value function P Vm r r V1 Vb V2 r V3 V4

∞ Gain = ∫ P(r) max{0,Vb(r) - Vm(r)} dr 0 Expected Branch Gain P Vm Vb r r V1 Vb V2 r V3 V4

Visual servo (.2, -.15) Dig(5) Drive (-1) Hi res Lo res Rock finder NIR Heuristic Guidance • Plangraphs generally used as heuristics – the plans they produce may not be executable: • Not all orderings considered. • All the usual plangraph limitations: • Delete lists generally not considered. • No mutual-exclusion representation. • Discrete outcomes not (currently) handled. • Action uncertainty is only in resource usage, not resulting state. • Output used as heuristic guidance for classical planner: • Start state • Goal(s) to achieve • Result is an executable plan of high value!

Expected Value 20 13:20 15 13:40 10 14:00 Power 5 14:20 Start time 14:40 Evaluating the final plan • Plangraph gives a heuristic estimate of the value of the plan. • Better estimate can be computed using Monte-Carlo techniques, but these are quite slow for a multi-dimensional continuous problem. • Figure required 500 samples per point, 4000x2000 points, so simulation of every branch of the plan 4 thousand million times. Slow!

Outline • Introduction • Problem Definition • A Classical Planning Approach • The Markov Decision Problem approach • Final Comments

Expected Value 20 13:20 15 13:40 10 14:00 Power 5 Start time 14:20 14:40 MDP Approach: Motivation Constant value function throughout region. Wouldn’t it be nice to only compute the value once! Approach: Exploit the structure in the problem to find constant (or linear regions).

Continuous MDPs • States: X = {X1,X2, . . . ,Xn} • Actions: A = {a1, a2, . . . , am} • Transition: Pa(X0|X) • Reward: Ra(X) • Dynamic programming (Bellman Backup): • Can’t be computed in general without discretization

Symbolic Dynamic Programming • Special representation of transition, reward and value using MTBDDs for discrete variables, kd-trees for continuous. • Representation makes problem structure (if any) explicit. • Dynamic programming on both the value function and the structured representation. • Idea is to do all operations of Bellman equation in MTBDD/kd-tree form.

Continuous State Abstraction • Requires rectangular transition, reward functions: • Transition probabilities remain constant (relative to current value) over region. • Transition function is discrete: approximate continuous functions by discretizing. • Required so family of value functions is closed under the Bellman Equation.

Continuous State Abstraction • Requires rectangular transition, reward functions: • Reward function piecewise constant or linear over region. • This, along with discrete transition function, ensures all value functions computed using Bellman equation are also piecewise constant or linear. • Approach is to compute exact solution to approximate model.

Value Iteration • Theorem:If Vnis rectangular PWC (PWL), then Vn+1is rectangular PWC (PWL). Pa Vn Vn+1 • Represent rectangular partitions using kd-trees.

Partitioning

Performance: 2 Continuous Variables

Performance: 3 Continuous Variables • For naïve, we just discretize everything at the given input resolution. • For the others, we discretize the transition functions at that resolution, but the algorithm may increase the resolution to accurately represent that final value function. This means that the value function is actually more accurate than for the naïve algorithm.

Final Remarks • Plangraph–based approach: • Produces “plans” - easy for people to interpret. • Fast heuristic estimate of the value of a plan/plan fragment. • Need an effective way to evaluate actual values to really know a branch is worthwhile. • Efficient representation for problems with many goals. • Still missing discrete action outcomes • MDP-based approach: • Produces optimal policies – the best you could possibly do. • Faster, more accurate value fn. computation (if there’s structure). • Hard to represent some problems effectively (e.g. fact that goals are worth something only before you reach them). • Policies are hard to interpret by humans. • Can be combined: Use MDP approach to evaluate quality of plans/plan fragments.

Future Work • We approximate by building an approximate model, then solving it exactly. One could also approximately solve the exact model. • The plangraph approach takes advantage of the current system state when planning to narrow the search. The MDP policy probably includes value computation for many unreachable states. • Preference elicitation is very important here. With many goals we need good estimates of their value. • This is part of a greater whole—rover planning problems. • Is the policy sufficiently efficiently encoded to transmit to the rover? • How much more complex does the executive need to be to carry out a contingent plan?

Planning with Uncertainty in Continuous Domains