- By
**paul** - Follow User

- 214 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Reinforcement Learning Applications in Robotics' - paul

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Overview

- Policy Gradient Algorithms
- RL for Quadrupal Locomotion
- PEGASUS Algorithm
- Autonomous Helicopter Flight
- High Speed Obstacle Avoidance
- RL for Biped Locomotion
- Poincare-Map RL
- Dynamic Planning
- Hierarchical Approach
- RL for Acquisition of Robot Stand-Up Behavior

RL for Quadruped Locomotion [Kohl04]

- Simple Policy-Gradient Example
- Optimize Gait for

Sony-Aibo Robot

- Use Parameterized Policy
- 12 Parameters
- Front + rear locus (height, x-pos, y-pos)
- Height of the front and the rear of the body
- …

Quadruped Locomotion

- Policy: No notion of state – open loop control!
- Start with initial Policy
- Generate t = 15 random policies Ri
- is
- Evaluate Value of each policy on the real robot
- Estimate gradient for each parameter
- Update policy into the direction of the gradient

Quadruped Locomotion

- Estimation of the Walking Speed of a policy
- Automated process of the Aibos
- Each Policy is evaluated 3 times
- One Iteration (3 x 15 evaluations) takes 7.5 minutes

Quadruped Gait: Results

- Better than the best known gait for AIBO!

Pegasus [Ng00]

- Policy Gradient Algorithms:
- Use finite time horizon, evaluate Value
- Value of a policy in a stochastic environment is hard to estimate
- => Stochastic Optimization Process
- PEGASUS:
- For all policy evaluation trials use fixed set of start states (scenarios)
- Use „fixed randomization“ for policy evaluation
- Only works for simulations!
- The same conditions for each evaluation trial
- => Deterministic Optimization Process!
- Can be solved by any optimization method
- Commonly Used: Gradient Ascent, Random Hill Climbing

Autonomous Helicopter Flight [Ng04a, Ng04b]

- Autonomously learn to fly an unmanned helicopter
- 70000 $ => Catastrophic Exploration!
- Learn Dynamics from the observation of a Human pilot
- Use PEGASUS to:
- Learn to Hover
- Learn to fly complex maneuvers
- Inverted Helicopter flight

Helicopter Flight: Model Indenfication

- 12 dimensional state space
- World Coordinates (Position + Rotation) + Velocities
- 4-dimensional actions
- 2 rotor-plane pitch
- Rotor blade tilt
- Tail rotor tilt
- Actions are selected every 20 ms

Helicopter Flight: Model Indenfication

- Human pilot flies helicopter, data is logged
- 391s training data
- reduced to 8 dimensions (position can be estimated from velocities)
- Learn transition probabilities P(st+1|st, at)
- supervised learning with locally weighted linear regression
- Model Gaussian noise for stochastic model
- Implemented a simulator for model validation

Helicopter Flight: Hover Control

- Desired hovering position :
- Very Simple Policy Class
- Edges are optained by human prior knowledge
- Learns more or less linear gains of the controller
- Quadratic Reward Function:
- punishment for deviation of desired position and orientation

Helicopter Flight: Hover Control

- Results:
- Better performance than Human Expert (red)

Helicopter Flight: Flying maneuvers

- Fly 3 manouvers from the most difficult RC helicopter competition class
- Trajectory Following:
- punish distance from projected point on trajectory
- Additional reward for making progress along the trajectory

Helicopter Flight: Results

- Videos:
- Video1Video2

Helicopter Flight: Inverse Flight

- Very difficult for humans
- Unstable!
- Recollect data for inverse flight
- Use same methods than before
- Learned in 4 days!
- from data collection to flight experiment
- Stable inverted flight controller
- sustained position

Video

High Speed Obstacle Avoidance [Michels05]

- Obstacle Avoidance with RC car in unstructured Environments
- Estimate depth information from monocular cues
- Learn controller with PEGASUS for obstacle avoidance
- Graphical Simulation : Does it work in the real environment?

Estimate Depths Information:

- Supervised Learning
- Divide image into 16 horizontal stripes
- Use features of the strip and the neighbored strips as input vectors.
- Target Values (shortest distance within a strip) either from simulation or laser range finders
- Linear Regression
- Output of the vision system
- angle of the strip with the largest distance
- Distance of the strip

Obstacle Avoidance: Control

- Policy: 6 Parameters
- Again, a very simple policy is used
- Reward:
- Deviation of the desired speed, Number of crashes

Obstacle Avoidance: Results

- Using a graphical simulation to train the vision system also works for outdoor environments
- Video

RL for Biped Robots

- Often used only for simplified planar models
- Poincare-Map based RL [Morimoto04]
- Dynamic Planning [Stilman05]
- Other Examples for RL in real robots:
- Strongly Simplify the Problem: [Zhou03]

Poincare Map-Based RL

- Improve walking controllers with RL
- Poincare map: Intersection-points of an n-dimensional trajectory with an (n-1) dimensional Hyperplane
- Predict the state of the biped a half cycle ahead at the phases :

Poincare Map

- Learn Mapping:
- Input Space : x = (d, d‘)
- Distance between stance foot and body
- Action Space :
- Modulate Via-Points of the joint trajectories
- Function Approximator: Receptive Field Locally Weighted Regression (RFWR) with a fixed grid

Via Points

- Nominal Trajectories from human walking patterns
- Control output

is used to modulate via points with a circle

- Hand selected via-points
- Increment via-points of one joint by the same amount

Learning the Value function

- Reward Function:
- 0.1 if height of the robot > 0.35m
- -0.1 else
- Standard SemiMDP update rules
- Only need to learn the value function

for and

- Model-Based Actor-Critic Approach
- A … Actor
- Update Rule:

Dynamic Programming for Biped Locomotion [Stilman05]

- 4-link planar robot
- Dynamic Programming for Reduced Dimensional Spaces
- Manual temporal decomposition of the problem into phases of single and double support
- Use intuitive reductions fo the state space for both phases

State-Increment Dynamic Programming

- 8-dimensional state space:
- Discretize State-Space by coarse grid
- Use Dynamic Programming:
- Interval ε is defined as the minimum time intervall required for any state index to change

State Space Considerations

- Decompose into 2 state space components (DS + SS)
- Important disctinctions between the dynamcs of DS and SS
- Periodic System:
- DP can not be applied separately to state space components
- Establish mapping between the components for the DS and SS transition

State Space Reduction

- Double Support:
- Constant step length (df)
- Can not change during DS
- Can change after robot completes SS
- Equivalent to 5-bar linkage model
- Entire state space can be described by 2 DoF (use k1 and k2)
- 5-d state space
- 10x16x16x12x12 grid => 368640 States

State Space Reduction

- Single Support
- Compass 2-link Model
- Assume k1 and k2 are constant
- Stance knee angle k1 has small range in human walking
- Swing knee k2 has strong effect on df, but can be prescribed in accordance with h2 with little effect on the robot‘s CoM
- 4-D state space
- 35x35x18x18 grid => 396900 states

State-Space Reduction

- Phase Transitions
- DS to SS transition occurs when the rear foot leaves the ground
- Mapping:
- SS to DS transition occurs when the swing leg makes contact
- Mapping:

Action Space, Rewards

- Use discretized torques
- DS: hip and both knee joints can accelerate the CoM
- Fix hip action to zero to gain better resolution for the knee joints
- Discretize 2-D action space from +- 5.4 Nm into 7x7 intervalls
- SS: Only choose hip torque
- 17 intervalls in the range of +- 1.8 Nm
- State x Actions
- 398640x49+396900x17 = 26280660 cells (!!)
- Reward:

Results

- 11 hours of computation
- The computed policy locates a limit cycle through the space.

Performance under error

- Alter different properties of the robot in simulation
- Do not relearn the policy
- Wide range of disturbancesare accepted
- Even if the used model of the dynamics is incorrect!
- Wide set of acceptable states allows the actual trajectory to be distinct from the expected limit cycle

Learning of a Stand-up Behavior [Morimoto00]

- Learning to stand-up with a 3-linked planar robot.
- 6-D state space
- Angles + Velocities
- Hierarchical Reinforcement Learning
- Task decomposition by Sub-goals
- Decompose task into:
- Non–linear problem in a lower dimensional space
- Nearly-linear problem in a high-dimensional space

Upper-level Learning

- Coarse Discretization of postures
- No speed information in the state space (3-d state space):
- Actions: Select sub-goals
- New Sub-goal

Upper-Level Learning

- Reward Function:
- Reward success of stand-up
- Reward also for the success of a subgoal
- Choosing sub-goals which are easier to reach from the current state is prefered
- Use Q(lambda) learning to learn the sequence of sub-goals

Lower-level learning

- Lower level is free to choose at which speed to reach sub-goal (desired posture)
- 6-D state space
- Use Incremental Normalized Gaussian networks (ING-net) as function approximator
- RBF network with rule for allocating new RBF-centers
- Action Space:
- Torque-Vector:

Lower-level learning

- Reward:
- -1.5 if the robot falls down
- Continuous time actor critic learning [Doya99]
- Actor and Critic are learnt with ING-nets.
- Control Output:
- Combination of linear servo controller and non-linear feedback controller

Results:

- Simulation Results
- Hierarchical architecture 2x faster than plain architecture
- Real Robot
- Before Learning
- During Learning
- After Learning
- Learned on average in 749 trials (7/10 learning runs)
- Used on average 4.3 subgoals

The end

- For People who are interested in using RL:
- RL-Toolbox
- www.igi.tu-graz.ac.at/ril-toolbox
- Thank you

Literature

- [Kohl04] Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion, N. Kohl and P. Stone, 2005
- [Ng00] PEGASUS : A policy search method for large MDPs and POMDPs, A. Ng and M. Jordan, 2000
- [Ng04a] Autonomous inverted helicopter flight via reinforcement learning, A. Ng et al., 2004
- [Ng04b] Autonomous helicopter flight via reinforcement learning, A. Ng et al., 2004
- [Michels05] High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning, J. Michels, A. Saxena and A. Ng, 2005
- [Morimoto04] A Simple Reinforcement Learning Algorithm For Biped Walking, J. Morimoto and C. Atkeson, 2004

Literature

- [Stilman05] Dynamic Programming in Reduced Dimensional Spaces: Dynamic Planning for Robust Biped Locomotion, M. Stilman, C. Atkeson and J. Kuffner, 2005
- [Morimoto00] Acquisition of Stand-Up Behavior by a Real Robot using Hierarchical Reinforcement Learning, J. Morimoto and K. Doya, 2000
- [Morimoto98] Hierarchical Reinforcement Learning of Low-Dimensional Subgoals and High-Dimensional Trajectories, J. Morimoto and K. Doya, 1998
- [Zhou03] Dynamic Balance of a biped robot using fuzzy Reinforcement Learning Agents, C. Zhou and Q.Meng, 2003
- [Doya99] Reinforcement Learning in Continuous Time And Space, K. Doya,1999

Download Presentation

Connecting to Server..