Create Presentation
Download Presentation

Download

Download Presentation

Machine Learning Techniques For Autonomous Aerobatic Helicopter Flight

178 Views
Download Presentation

Download Presentation
## Machine Learning Techniques For Autonomous Aerobatic Helicopter Flight

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Machine Learning Techniques For Autonomous Aerobatic**Helicopter Flight Joseph Tighe**Helicopter Setup**• XCell Tempest helicopter • Micorstrain 3DM-GX1 orientation sensor • Triaxial accelerometers SP? • Rate gyros • Magnetometer • Novatel RT2 GPS**What are some differences between this problem and ones**we’ve seen so far? Static Learning Learning Control • Set training and testing set • Try to “learn” from the training set to predict the testing set • The task we are learning is static (does not change from one trial to the next) • Training set can still be known upfront • No testing set • We are learning how to perform a dynamic task, we need to be able to adapt to changes mid task**Helicopter Environment and Controls**• To fully describe the helicopter’s “state” mid-flight: • Position: (x, y, z) • Orientation (expressed as a unit quaternion) • Velocity (x’, y’, z’) • Angular velocity(x y z) • The helicopter can be controlled by: • (u1, u2): Pitch • (u3): Tail Rotor • (u4): Collective pitch angle**What is needed to fly autonomous?**• Trajectory • The desired path for the helicopter to follow • Dynamics Model • Inputs: current state and controls (u1, u2, u3, u4) • Output: predicts where the helicopter will be at the next time step • Controller • The application the feeds the helicopter the correct controls to fly the desired trajectory**Trajectory**• A path through space that fully describes the helicopter's flight. • It is specified by a sequence of states that contain: • Position: (x, y, z) • Orientation (expressed as a unit quaternion) • Velocity (x’, y’, z’) • Angular velocity(x y z) • For flips this is relatively simple to encode by hand • Later we will look at a way to learn this trajectory from multiple demonstrations of the same maneuver**Simple Dynamics Model (Car)**• What state information is needed?: • Position on ground (x, y) • Orientation on ground () • Speed (x’, y’) • What are the controls? • Current gear • Accelerator/Break • Steering wheel position • What would the dynamics model do? • Given state and control compute an acceleration vector and angular acceleration vector**Helicopter Dynamics Model**• Our state and controls are more complicated than the car example. • There are also many hidden variables that we can’t expect to model accurately. • Air, rotor speed, actuator delays, etc. • Conclusion: much harder problem than the car example • we’ll have to learn the model**Controller**• Given a target trajectory and current state compute the best controls for the helicopter. • Controls are (u1, u2, u3, u4) • (u1, u2): Pitch • (u3): Tail Rotor • (u4): Collective pitch angle**Overview of the two approaches**• Given one example flight and target trajectory specified by hand, learn a model and controller that can fly the trajectory • Given a number of example flights of the same maneuver, learn the trajectory, model and controller that can fly the trajectory.**Approach 1: Known Trajectory**P. Abbeel, A. Coates, M. Quigley, and A. Ng, An Application of Reinforcement Learning to Aerobatic Helicopter Flight, NIPS 2007**Overview**Dynamics Model Data Known Trajectory + Penalty Function Reward Function Reinforcement Learning Policy We want a robot to follow a desired trajectory. Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008**Markov Decision Processes**• Modeled as a sextuple (S, A, P(·|·, ·), H, s(0), R) • S: All possible states of our system • A: All possible actions we can perform • P(s’| s, a): The probability that an action a in state s at a time t will lead to s’ at time t+1. • H: The time over which the system will run (not strictly needed) • s(0): The start state • R(a, s, s’): is the reward for transitioning from state s to s’ after taking action a. This function can be unique for each time step t.**Markov Decision Processes**• Once we have this model we wish to find a policy, (s), that maximizes the expected reward • (s) is a mapping from the set of states S to the set of action A, with a unique mapping for each time step. • V(s’) is sum of rewards achieved by following from s’**Back to helicopter modeling**• For our problem: • S: is the range of orientations and speeds that are allowed • A: is our range of control inputs • H: The length of our trajectory • s(0): Where the helicopter starts • P(s’|s, a): Our dynamics model (unknown) • R(a, s, s’): Tied to the desired trajectory (trivially computed) • (s): Our controller (unknown)**Overview**Dynamics Model Data Known Trajectory + Penalty Function Reward Function Reinforcement Learning Policy We want a robot to follow a desired trajectory. Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008**Reinforcement Learning**• Tries to find a policy that maximizes the long term reward of an environment often modeled by a MDP. • First an exploration phase explores state/action pairs whose transition probabilities are still unknown. • Once the MDP transition probabilities are modeled well enough an exploitation phase maximizes the sum of rewards over time.**Exploration vs Exploitation**• More exploration will give a more accurate MDP model • More exploitation will give a better policy for the given model • What issues might we have with exploration stage for our problem? • Aggressive exploration can cause the helicopter to crash**Apprenticeship Learning**• Exploration: Start with an example flight • Compute a dynamics model and reward function based on the target trajectory and sample flight • Giving you a MDP model • Exploitation:Find a controller (policy: ) that maximizes this reward • Exploration:Fly the helicopter with the current controller and add this data to the sample flight data • If we flew the target trajectory stop, otherwise go back to step 2**Dynamics Model**• Linear model • We must learn parameters: A, B, C, D, E • g: gravity field • b: body coordinate frame • w: Gaussian random variable Forward Sideways Up/Down Figure taken from: P. Abbeel, A. Coates, M. Quigley, and A. Ng, An Application of Reinforcement Learning to Aerobatic Helicopter Flight, NIPS 2007**Approach 2: Learn Trajectory**A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008.**Learning Trajectory and Controller**Dynamics Model Data Trajectory + Penalty Function Reward Function Reinforcement Learning Policy We want a robot to follow a desired trajectory. Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008**Key difficulties**• Often very difficult to specify trajectory by hand. • Difficult to articulate exactly how a task is performed. • The trajectory should obey the system dynamics. • Use an expert demonstration as trajectory. • But, getting perfect demonstrations is hard. • Use multiple suboptimal demonstrations. Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008**Problem Setup**• Given: • Multiple demonstrations of the same maneuver • s: sequence of states • u: control inputs • M: number of demos • Nk: length of demo k for k =0..M-1 • Goal: • Find a “hidden” target trajectory of length T**Graphical model**Intended trajectory • Intended trajectory satisfies dynamics. • Expert trajectory is a noisy observation of one of the hidden states. • But we don’t know exactly which one. Expert demonstrations Time indices Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008**Learning algorithm**• Make an initial guess for . • Alternate between: • Fix . Run EM on resulting HMM. • Choose new using dynamic programming. If is unknown, inference is hard. If is known, we have a standard HMM. Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008**Algorithm Overview**• Make initial guess for : say a even step size of T/N • E-Step: Find a trajectory by smoothing the expert demonstrations • M-step: With this trajectory update the covariances using the standard EM update • E-Step: run dynamic time warping to find a that maximizes the P(z, y) or the probability of the current trajectory and expert examples • M-Step: Find d given . • Repeat steps 2-5 until convergence**Dynamic Time Warping**• Used in speech recognition and biological sequence alignment (Needleman-Wunsch) • Given a distribution of time warps (d) dynamic programming is used to solve for . Figure taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008**Results for Loops**Figure taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008**Details: Drift**• The various expert demonstrations tend to drift in different ways and at different time. • Because these drift errors are highly correlated between time steps Gaussian noise does a poor job of modeling them • Instead drift is explicitly modeled by slow changing translation in space for each time point.**Details: Prior Knowledge**• It is also possible to incorporate expert advice or prior knowledge • For example: flips should keep the helicopter center fixed in space or loops should lie on a plane in space • This prior knowledge is used as additional constrains in both EM steps of the algorithm**Dynamics Model**Dynamics Model Data Trajectory + Penalty Function Reward Function Reinforcement Learning Policy Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008**Standard modeling approach**• Collect data • Pilot attempts to cover all flight regimes. • Build global model of dynamics 3G error! Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008**Errors aligned over time**• Errors observed in the “crude” model are clearly consistent after aligning demonstrations. Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008**New Modeling Approach**• The key observation is that the errors in the various demonstrations are the same. • This can be thought of as reveling the hidden variables discussed earlier: • Air, rotor speed, actuator delays, etc. • We can use this error to correct a “crude” model.**Time-varying Model**• f: is the “crude” model • : is the bias computed by the difference between the crude model predicted trajectory and the target trajectory in a small window of time around t. • : Gaussian noise**Final**Dynamics Model Data Trajectory + Penalty Function Reward Function Reinforcement Learning Policy Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008**Summary**• The trajectory, dynamics model and controller are all learned • The dynamics model is specific to a portion of the maneuver being performed**Compare Two Techniques**Technique 1 Technique 2 • Hand specified trajectory • Learn global model and controller • For a new maneuver: an example flight must be given and new trajectory specified • Learn trajectory, time varying model and controller • For a new maneuver: a couple of example flights must be given + 30 min of learning