Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University

Combining Exploration and Exploitation in Landmine Detection Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University

Outline • Introduction • Partially observable Markov decision processes (POMDPs) • Model definition & offline learning • Lifelong-learning algorithm • Experimental results

Landmine detection (1) • Landmine detection: – By robots instead of human beings – Underlying model controlling the robot: POMDP • Multiple sensors: – Single sensor: sensitive to only certain types of objects EMI sensor: conductivity GPR sensor: dielectric property Seismic sensor: mechanical property – Multiple complementary sensors: improve detection performance

Landmine detection (2) • Statement of the problem: • Given a minefield where some landmines and clutter are buried underground; • Two types of sensors are available: EMI sensor and GPR sensor; • Sensing has cost; • Correct / incorrect declarations have corresponding reward / penalty; How can we develop a strategy to effectively find the landmines in this minefield with the minimal cost ? Questions in it: • How to optimally choose sensing positions in the field, so as to use as few sensing points as possible to find the landmines? • How to optimally choose sensors in each sensing position? • When to sense and when to declare?

Landmine detection (3) • Solution sketch: A partially observable Markov decision process (POMDP) model is built to solve this problem since it provides an approach to select actions (sensor deployment, sensing positions and declarations) optimally based on maximal reward / minimal cost. • Lifelong learning • The robot learns the model at the same time as it moves and senses in the mine field (combining exploration and exploitation); • Model is updated based on the exploration process.

ENVIRONMENT Action Observation b State estimation Policy AGENT POMDP (1) POMDP = HMM + controllable actions + rewards A POMDP is a model of an agent interacting synchronously with its environment. The agent takes as input the observations of the environment, estimates the state according to observed information, and then generates as output actions based on its policy. During the periodic observation-action loops, the agent gets maximal reward, or equivalently, minimal cost.

POMDP (2) A POMDP model is defined by the tuple < S, A, T, R, , O> • S is a finite set of discrete states of the environment. • A is a finite set of discrete actions. •  is a finite set of discrete observations providing noisy state information • T: SA  (S) is the state transition probability: the probability of transitioning from state s to s’ when taking action a • O: SA  () is the observation function: the probability of receiving observation o after taking action a, landing in state s’. • R: SA   , R(s, a)is the expected reward the agent receives by taking action a in state s.

POMDP (3) Belief state b: • The agent believes which state it is currently in; • A probability distribution over all the state S; • A summary of past information; • updated in each step by Bayes rule; • based on the latest action and observation, and the previous belief state:

POMDP (4) Policy: • A mapping from belief states to actions; • Telling agent which action it should take given current belief state. Optimal policy: Horizon length • Maximize the expected discounted reward Immediate reward Discounted future reward • V*(b) is piecewise linear and convex in belief state (Sondik, 1971); • Represent V*(b) by a set of |S|-dimensional vector {α1*, …, αm*}

POMDP (5) Policy learning: • Solve for vectors {α1*, …, αm*}; • Point based value iteration (PBVI) algorithm; • Iteratively updates the vector α and value V for a set of sample belief points. One step from the horizon n+1 step from the horizon is computed from the results of the n step with where

Model definition (1) • Feature extraction – EMI sensor EMI Model: Sensor measurements: Model parameters extracted by nonlinear fitting method:

Model definition (2) • Feature extraction – GPR sensor Time Down-track position • Raw moments – energy features • Central moments – variance and asymmetry of the wave

s1 s5 s9 s6 s2 s8 s4 s7 Vector quantization EMI feature vectors EMI codebook s3 Ω Union Ground surface Vector quantization GPR codebook GPR feature vectors Target Model definition (3) • Definition: observation Ω • Definition: state S

likelihood prior posterior evidence Model definition (4) • Estimate |S| and |Ω| • Variational Bayesian (VB) expectation-maximization (EM) method for model selection. • Bayesian learning • One Criterion: comparemodel evidence (marginal likelihood)

GPR observation |S|=5 |S|=9 |S|=13 |S|=1 Underlying state EMI observation Horizontal sensing sequence Sensing point: EMI and GPR measurements Vertical sensing sequence Model definition (5) • Estimate |S| and |Ω| Candidate models: – HMMs with two sets of observations – |S|=1,5,9,… – |Ω|=2,3,4,…

Model definition (6) • Estimate |S| and |Ω| Estimate |S| Estimate |Ω|

N W E S Model definition (7) • Specification of action A 10 sensing actions : allow movements of 4 directions 1: Stay, GPR sensing 2: South, GPR sensing 3: North, GPR sensing 4: West, GPR sensing 5: East, GPR sensing 6: Stay, EMI sensing 7: South, EMI sensing 8: North, EMI sensing 9: West, EMI sensing 10: East, EMI sensing Declaration actions : declare as one type of target 11: Declare as ‘metal mine’ 12: Declare as ‘plastic mine’ 13: Declare as ‘Type-1 clutter’ 14: Declare as ‘Type-2 clutter’ 15: Declare as ‘clean’

1 10 5 14 2 6 11 9 15 4 8 18 13 17 19 7 16 23 3 12 20 24 27 22 26 25 21 28 29 Model definition (8) • Estimate T Across all 5 types of mines and clutter, a total of 29 states are defined. metal mine plastic mine type-1 clutter type-2 clutter clean

1 5 σ1 σ4 σ3 δ δ σ2 4 8 6 2 9 Model definition (9) • Estimate T • “Stay” actions do not cause state transition–identity matrix • Other sensing actions cause state transition – by elementary geometric probability computation • “Declaration” actions reset the problem – uniform distribution over states where a=“walk south and then sense with EMI” or “walk south and then sense with GPR” δ: distance traveled in a single step by a robot σ1, σ2, σ3 and σ4 denote the 4 borders of state 5, as well as their respective area metric.

1 5 9 2 8 6 4 7 3 1 5 2 6 9 4 8 7 3 Model definition (10) • Estimate T • Assume a mine or clutter is buried separately; • State transitions happen only within the states of a target when robot moves; • “Clean” is a bridge between the targets. Other target Other target 29 Other target “clean” Metal mine states

metal mine block plastic mine block Type-1 clutter block Type-2 clutter block “clean” Model definition (11) • Estimate T • State transition matrix: diagonal block • Model expansion: add more diagonal block, each one a target

Model definition (12) • Specification of reward R Sensing : -1 Each sensing (either EMI or GPR) has a cost -1 Correct declaration : +10 Correctly declare a target Partially correct declaration : +5 Confused between different types of landmines Confused between different types of clutter Incorrect declaration : large penalty Missing: declare as “clean” or clutter when it is a landmine : -100 False alarm: declare as a landmine when it is clean or clutter : -50

Lifelong learning (1) – Model-based algorithm – No training data available in advance: Learn the POMDP model by Bayes approach during the exploration & exploitation processes. – Assume a rough model is given, but some model parameters are uncertain – An oracle is available, which can provide exact information about target label, size and position, but using oracle is expansive. – Criteria to use oracle: 1. Policy selects “oracle query” action 2. Agent finds new observations – new knowledge 3. After sensing a lot, agent still cannot make decisions – too difficult

Lifelong learning (2) – “Oracle query” includes three steps: 1. measures data of both sensors on a grid 2. true target label is revealed 3. build target model based on measured data – Two learning approaches: 1. model expansion (more target types are considered) 2. model hyper-parameter update

Lifelong learning (3) Dirichlet distribution • A distribution of multinomial distribution parameters. • A conjugate prior to the multinomial distribution. • We can put a Dirichlet prior for each state-action pair in transition probability and observation function variables , parameters with where

Lifelong learning (4) Algorithm: 1. Imperfect model M0, containing “clean” and some mine or clutter types, with the corresponding S and Ω; S and Ω could be expanded in the learning process; 2. “Oracle query” is one possible action; Set learning rate λ; Set Dirichlet prior according to the imperfect model M0; For any unknown transition probability, For any unknown observation,

Lifelong learning (4) Algorithm: 5. Sample N models , solve policies ; 6. Initial the weights wi=1/N; 7. Initial the history h={}; 8. Initial the belief state b0 for each model 9. Run the experiment. At each time step: Compute the optimal actions for each model: ai=πi(bi) for i=1,…,N; Pick an action a according to the weights wi: p(ai)=wi; c. If one of the three query conditions is met (exploration): (1) Sense current local area on a grid; (2) Current target label is revealed; (3) Build the sub-model for current target and compute the hyper-parameters. If the target is a new target type Expand model by including the new target type as a diagonal block; Else, the target is an existing target type Update the Dirichlet parameters of the current target type (next page):

d. If query is not required (exploitation): (1) Take action a; (2) Receive observationo; (3) Update belief state for each sampled model. (4) Update the history e. Update the weights wi by forward-backward algorithm f. Pruning. At regular intervals: (1) Remove the model samples with the lowest weights and redraw new ones ; (2) Solve the new model policies ; (3) Update the belief according to the history h until current time ; (4) Recompute the weights according to the history h until current time . Lifelong learning (4) Algorithm:

Results (1) • Data description: – mine fields : 1.6×1.6 m2 – sensing on a spatial grid of 2cm by 2cm – two sensors: EMI and GPR • Robot navigation: – search almost everywhere to avoid missing landmines – active sensing to minimize the cost “Basic path” + “lanes” – The “basic path” restrains the robot from moving across the lanes. –the robot takes actions to determine its sensing positions within the lanes.

Results (2) • Offline-learning approach: performance summary Metal clutter:soda can, shell, nail, quarter, penny, screw, lead, rod, ball bearing Nonmetal clutter: rock, bag of wet sand, bag of dry sand, CD

Declaration marks: “clean” metal mine plastic mine type-1 clutter type-2 clutter “unknown” C * * * * * Results (3) • Offline-learning approach: Minefield 1 Detection result Ground truth M M P P M P: plastic mine; M: metal mine; Other: clutter 1 missing; 2 false alarms

Declaration marks: “clean” metal mine plastic mine “unknown” C * * ? Sensing marks: EMI sensor GPR sensor Results (4) • Sensor deployment – Plastic mine: GPR sensor – Metal mine: EMI sensor – “Clean” & center of a mine: sensing few times (2-3 times in general) – interface of mine/“clean”: sensing many times

M M P P M Red rectangular region: oracle query Other mark: declarations: Results (5) Red: metal mine Pink: plastic mine Yellow: clutter1 Cyan: clutter2 Blue “c”: clean • Lifelong-learning approach: Minefield 1 Ground truth Initial learning from Mine field 1

Results (6) • Lifelong-learning approach: compared with offline learning Difference of the parameters between the model learned by lifelong learning and the model learned by offline learning (training data are given in advance). The three big error drops correspond to adding new targets into the model.

M P M P P M M 19 Rocks Red rectangular region: oracle query Other mark: declarations Results (7) Red: metal mine Pink: plastic mine Yellow: clutter1 Cyan: clutter2 Blue “c”: clean • Lifelong-learning approach: Minefield 2 Sensing Minefield 2 after the model was learned from Minefield 1 Ground truth

P M M P M M P Red rectangular region: oracle query Other mark: declarations Results (8) Red: metal mine Pink: plastic mine Yellow: clutter1 Cyan: clutter2 Blue “c”: clean • Lifelong-learning approach: Minefield 3 Ground truth

Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University