Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ

Two Layer Architecture • The lower layer provides fast, short horizon decision. • The lower layer is designed to keep robot out of trouble. • The upper layer ensures that the robot continually works toward its target task or goal.

Advantages • Offers reliability. • Reliability: the robot must be able to deal with failure of sensors and actuators. • Hardware failure = mission failure • Example, robots operating out of direct human control: • Space exploration • Office robot

The System • It has two levels of control: • The lower level controls the actuators that move the robot around and provides a set of behaviors that can be used by the higher level of control. • The upper level, planning system, plans a sequence of actions in order to move the robot from its current location to the goal.

The Architecture • The bottom level is accomplished by RL: • RL as an incremental learning is able to learn online. • RL can adapt changes in the environment. • RL reduce the programmer intervention.

The Architecture • The higher level is POMDP planner: • POMDP planner operates quickly once a policy is generated. • POMDP planner can provide reinforcement needed by lower level behaviors.

The Test • For test, the Kephera robot simulator is used. • Kephera has limited sensors. • It has well-defined environment. • The simulator can run much faster than real-time. • The simulator does not require human intervention for low battery conditions and sensor failures.

Methods for Low-Level Behaviors • Subsumption • Learning from examples. • Behavioral cloning.

Methods for Low-Level Behaviors • Neural systems tend to be robust to noise and perturbation in the environment. • GeSAM is a neural network based robot hand control system. GeSAM uses adaptive neural network. • Neural networks often require long trainning periods and large amounts of data.

Methods for Low-Level Behaviors • RL can learn continuously. • RL provide adaptation to sensor drift and changes in actuators. • In many extreme cases, sensor or actuator failures adapt enough to allow the robot to accomplish the mission.

Planning at the Top • POMDP deals with the uncertainity. • For Kephera, with limited sensors, determining the exact state is very difficult. • Also, the effects of actuators may not be deterministic.

Planning at the Top • Some rewards are associated with the goal state. • Some rewards are associated with performing some action in a certain state. • Thus, this will allow to define complex, compound goals.

Drawback • The current POMDP solution method: • Does not scale well with the size of state space. • Exact solutions are only feasible for very small POMDP planning problems. • Requires that the robot be given a map, which is not always feasible.

What is Gained? • By combining RL and POMDP, the system is robust to changes. • RL will learn how to use the damaged sensors and actuators. • Continuous learning has some drawbacks when using backpropagation neural networks. Over-trainning. • POMDP adapt to sensor and actuator failures by adjusting the transition probabilities.

The Simulator • Pulse encoders are not used in this work. • The simulation results can be successfully transferred to a real robot. • The sensor model includes stochastic modeling of noise and responds similarly to the real sensors. • The simulation environment includes some stochastic modeling of wheel slippage and accelaration. • Hooks are added into the simulator to allow to simulate sensor failures. • Effector failures are simulated in the code.

RL Behaviors • Three basic behavior, move forward, turn right and turn left. • The robot is always moving or performing an action. • RL is responsible for dealing: • With obstacles, • With adjusting sensor or actuator malfunction.

RL Behaviors • The goal of the RL module is to maximize the reward given them by the POMDP planner. • The reward is a function how long it took to make a desired state transition. • Each behaviors has its own RL module. • Only one RL module can be active in a given time. • Q-learning with table lookup for approximating the value function. • Fortunately, the problem so far small enough for table lookup.

POMDP planning • Since robots can rarely determine their state from sensor observations, COMDP do not work well in many real-world robot planning tasks. • It is more adequate to use the state probability distribution, and update using the information about transition and obsservation probabilities.

Sensor Grouping • Kephera has 8 sensors that report distance values between 0 and 1024. • The observations are reduced to 16: • The sensors are grouped in pairs to make 4 pseudo sensors, • Tresholding applied to the output of the sensors. • POMDP planner is now robust to single sensor failures.

Solving a POMDP • Witness algorithm is used to compute the optimal policy for POMDP. • Witness does not scale well with the size of the state space.

Environment and State Space • 64 possible state for the robot: • 16 discrete positions. • Robot’s heading is disceretized into the four compass directions. • Sensor information was reduced to 4 bits by combining the sensors in pairs and thresholding. • Solution to LP required several days on a Sun Ultra 2 workstation.

Environment and State Space

Interface Between Layers • POMDP uses current belief state to select low level behavior to activate. • The implementation tracks the state with the highest probability: the most likely current state. • If the most likely current state changes to the state that POMDP want, a reward of 1, otherwise –1 is generated.

Hypothesis • Since RLPOMDP is adaptive, the author expect that the overall performance should degrade gracefully as sensors and actuators gradually fails.

Evaluation • State 13 is the goal state. • POMDP state transition and observation probabilities obtained by placing the robot in each 64 state and taking each action ten times. • With the policy in place,RL modules are trained in the same way. • For each system configuration (RL or hand coded), the simulation is started from every position and orientation and performance is recorded.

Metrics • Failures during trial evaluating the reliability. • Average steps to goal asses the efficiency.

Gradual Sensor Failure • Battery power is used up, dust accumulates on sensors.

Intermittent Actuator Failure • Right motor control signal failed.

Conclusion • The RLPOMDP exihibits robust behavior in the presence of sensor and actuator degradation. • Future work scaling the problem. • To overcome the scaling problem of table lookup of RL, neural nets can be used (learnforget cycle). • To increase the size of the space for the POMDP, non-optimal solution algorithms are investigated. • New behaviors will be added.

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Presentation Transcript

Robot Paintings Evolved Using Simulated Robots

Two Fill Layer

Robot Architecture

Gyrobot a Two Wheel Balancing Robot

A Two-Level Architecture for Internet Signaling

Physical Architecture Layer Design

On Three-Layer Architecture

KI2 – MDP / POMDP

A Two-Tier Sandbox Architecture for Untrusted JavaScript

Two-layer datasets

Integrating ORCiD A two way conversation

Optimal Policies for POMDP

AuRA: Autonomous Robot Architecture

two-layer model.

What is a Robot Architecture?

Robot Architecture

Integrating Architecture

ASF Architecture: Presentation Layer

Saferdrill: Control Architecture of a Walking and Climbing Robot

POMDP

Layer Architecture