Download
machine learning techniques for autonomous aerobatic helicopter flight n.
Skip this Video
Loading SlideShow in 5 Seconds..
Machine Learning Techniques For Autonomous Aerobatic Helicopter Flight PowerPoint Presentation
Download Presentation
Machine Learning Techniques For Autonomous Aerobatic Helicopter Flight

Machine Learning Techniques For Autonomous Aerobatic Helicopter Flight

178 Views Download Presentation
Download Presentation

Machine Learning Techniques For Autonomous Aerobatic Helicopter Flight

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Machine Learning Techniques For Autonomous Aerobatic Helicopter Flight Joseph Tighe

  2. Helicopter Setup • XCell Tempest helicopter • Micorstrain 3DM-GX1 orientation sensor • Triaxial accelerometers SP? • Rate gyros • Magnetometer • Novatel RT2 GPS

  3. What are some differences between this problem and ones we’ve seen so far? Static Learning Learning Control • Set training and testing set • Try to “learn” from the training set to predict the testing set • The task we are learning is static (does not change from one trial to the next) • Training set can still be known upfront • No testing set • We are learning how to perform a dynamic task, we need to be able to adapt to changes mid task

  4. Helicopter Environment and Controls • To fully describe the helicopter’s “state” mid-flight: • Position: (x, y, z) • Orientation (expressed as a unit quaternion) • Velocity (x’, y’, z’) • Angular velocity(x y z) • The helicopter can be controlled by: • (u1, u2): Pitch • (u3): Tail Rotor • (u4): Collective pitch angle

  5. What is needed to fly autonomous? • Trajectory • The desired path for the helicopter to follow • Dynamics Model • Inputs: current state and controls (u1, u2, u3, u4) • Output: predicts where the helicopter will be at the next time step • Controller • The application the feeds the helicopter the correct controls to fly the desired trajectory

  6. Trajectory • A path through space that fully describes the helicopter's flight. • It is specified by a sequence of states that contain: • Position: (x, y, z) • Orientation (expressed as a unit quaternion) • Velocity (x’, y’, z’) • Angular velocity(x y z) • For flips this is relatively simple to encode by hand • Later we will look at a way to learn this trajectory from multiple demonstrations of the same maneuver

  7. Simple Dynamics Model (Car) • What state information is needed?: • Position on ground (x, y) • Orientation on ground () • Speed (x’, y’) • What are the controls? • Current gear • Accelerator/Break • Steering wheel position • What would the dynamics model do? • Given state and control compute an acceleration vector and angular acceleration vector

  8. Helicopter Dynamics Model • Our state and controls are more complicated than the car example. • There are also many hidden variables that we can’t expect to model accurately. • Air, rotor speed, actuator delays, etc. • Conclusion: much harder problem than the car example • we’ll have to learn the model

  9. Controller • Given a target trajectory and current state compute the best controls for the helicopter. • Controls are (u1, u2, u3, u4) • (u1, u2): Pitch • (u3): Tail Rotor • (u4): Collective pitch angle

  10. Overview of the two approaches • Given one example flight and target trajectory specified by hand, learn a model and controller that can fly the trajectory • Given a number of example flights of the same maneuver, learn the trajectory, model and controller that can fly the trajectory.

  11. Approach 1: Known Trajectory P. Abbeel, A. Coates, M. Quigley, and A. Ng, An Application of Reinforcement Learning to Aerobatic Helicopter Flight, NIPS 2007

  12. Overview Dynamics Model Data Known Trajectory + Penalty Function Reward Function Reinforcement Learning Policy We want a robot to follow a desired trajectory. Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008

  13. Markov Decision Processes • Modeled as a sextuple (S, A, P(·|·, ·), H, s(0), R) • S: All possible states of our system • A: All possible actions we can perform • P(s’| s, a): The probability that an action a in state s at a time t will lead to s’ at time t+1. • H: The time over which the system will run (not strictly needed) • s(0): The start state • R(a, s, s’): is the reward for transitioning from state s to s’ after taking action a. This function can be unique for each time step t.

  14. Markov Decision Processes • Once we have this model we wish to find a policy, (s), that maximizes the expected reward • (s) is a mapping from the set of states S to the set of action A, with a unique mapping for each time step. • V(s’) is sum of rewards achieved by following  from s’

  15. Back to helicopter modeling • For our problem: • S: is the range of orientations and speeds that are allowed • A: is our range of control inputs • H: The length of our trajectory • s(0): Where the helicopter starts • P(s’|s, a): Our dynamics model (unknown) • R(a, s, s’): Tied to the desired trajectory (trivially computed) • (s): Our controller (unknown)

  16. Overview Dynamics Model Data Known Trajectory + Penalty Function Reward Function Reinforcement Learning Policy We want a robot to follow a desired trajectory. Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008

  17. Reinforcement Learning • Tries to find a policy that maximizes the long term reward of an environment often modeled by a MDP. • First an exploration phase explores state/action pairs whose transition probabilities are still unknown. • Once the MDP transition probabilities are modeled well enough an exploitation phase maximizes the sum of rewards over time.

  18. Exploration vs Exploitation • More exploration will give a more accurate MDP model • More exploitation will give a better policy for the given model • What issues might we have with exploration stage for our problem? • Aggressive exploration can cause the helicopter to crash

  19. Apprenticeship Learning • Exploration: Start with an example flight • Compute a dynamics model and reward function based on the target trajectory and sample flight • Giving you a MDP model • Exploitation:Find a controller (policy: ) that maximizes this reward • Exploration:Fly the helicopter with the current controller and add this data to the sample flight data • If we flew the target trajectory stop, otherwise go back to step 2

  20. Dynamics Model • Linear model • We must learn parameters: A, B, C, D, E • g: gravity field • b: body coordinate frame • w: Gaussian random variable Forward Sideways Up/Down Figure taken from: P. Abbeel, A. Coates, M. Quigley, and A. Ng, An Application of Reinforcement Learning to Aerobatic Helicopter Flight, NIPS 2007

  21. Sample Flight

  22. Hard Trajectory

  23. Approach 2: Learn Trajectory A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008.

  24. Learning Trajectory and Controller Dynamics Model Data Trajectory + Penalty Function Reward Function Reinforcement Learning Policy We want a robot to follow a desired trajectory. Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008

  25. Key difficulties • Often very difficult to specify trajectory by hand. • Difficult to articulate exactly how a task is performed. • The trajectory should obey the system dynamics. • Use an expert demonstration as trajectory. • But, getting perfect demonstrations is hard. • Use multiple suboptimal demonstrations. Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008

  26. Expert Air Shows

  27. Problem Setup • Given: • Multiple demonstrations of the same maneuver • s: sequence of states • u: control inputs • M: number of demos • Nk: length of demo k for k =0..M-1 • Goal: • Find a “hidden” target trajectory of length T

  28. Graphical model Intended trajectory • Intended trajectory satisfies dynamics. • Expert trajectory is a noisy observation of one of the hidden states. • But we don’t know exactly which one. Expert demonstrations Time indices Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008

  29. Learning algorithm • Make an initial guess for . • Alternate between: • Fix . Run EM on resulting HMM. • Choose new  using dynamic programming. If is unknown, inference is hard. If is known, we have a standard HMM. Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008

  30. Algorithm Overview • Make initial guess for : say a even step size of T/N • E-Step: Find a trajectory by smoothing the expert demonstrations • M-step: With this trajectory update the covariances using the standard EM update • E-Step: run dynamic time warping to find a  that maximizes the P(z, y) or the probability of the current trajectory and expert examples • M-Step: Find d given . • Repeat steps 2-5 until convergence

  31. Dynamic Time Warping • Used in speech recognition and biological sequence alignment (Needleman-Wunsch) • Given a distribution of time warps (d) dynamic programming is used to solve for . Figure taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008

  32. Expert Examples Time Aligned

  33. Results for Loops Figure taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008

  34. Details: Drift • The various expert demonstrations tend to drift in different ways and at different time. • Because these drift errors are highly correlated between time steps Gaussian noise does a poor job of modeling them • Instead drift is explicitly modeled by slow changing translation in space for each time point.

  35. Details: Prior Knowledge • It is also possible to incorporate expert advice or prior knowledge • For example: flips should keep the helicopter center fixed in space or loops should lie on a plane in space • This prior knowledge is used as additional constrains in both EM steps of the algorithm

  36. Dynamics Model Dynamics Model Data Trajectory + Penalty Function Reward Function Reinforcement Learning Policy Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008

  37. Standard modeling approach • Collect data • Pilot attempts to cover all flight regimes. • Build global model of dynamics 3G error! Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008

  38. Errors aligned over time • Errors observed in the “crude” model are clearly consistent after aligning demonstrations. Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008

  39. New Modeling Approach • The key observation is that the errors in the various demonstrations are the same. • This can be thought of as reveling the hidden variables discussed earlier: • Air, rotor speed, actuator delays, etc. • We can use this error to correct a “crude” model.

  40. Time-varying Model • f: is the “crude” model • : is the bias computed by the difference between the crude model predicted trajectory and the target trajectory in a small window of time around t. • : Gaussian noise

  41. Final Dynamics Model Data Trajectory + Penalty Function Reward Function Reinforcement Learning Policy Slide taken from: A. Coates, P. Abbeel, and A. Ng, Learning for Control from Multiple Demonstrations, ICML 2008

  42. Summary • The trajectory, dynamics model and controller are all learned • The dynamics model is specific to a portion of the maneuver being performed

  43. Compare Two Techniques Technique 1 Technique 2 • Hand specified trajectory • Learn global model and controller • For a new maneuver: an example flight must be given and new trajectory specified • Learn trajectory, time varying model and controller • For a new maneuver: a couple of example flights must be given + 30 min of learning

  44. Autonomous Air Show

  45. Error of Autonomous Air Show