1 / 25

Abstract

Abstract. LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can not treat discontinuities well Propose new type of basis function, Geodesic Gaussian Kernels Apply our method to robot control tasks.

cindy
Download Presentation

Abstract

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Abstract • LSPI (Least-Squares Policy Iteration) works well in value function approximation • Gaussian kernel is a popular choice as a basis function but can not treat discontinuities well • Propose new type of basis function, Geodesic Gaussian Kernels • Apply our method to robot control tasks

  2. Maze Problem Up Reward: +1 (reach the goal) 0 (otherwise) Left Right Good Job!! Position Down • Task: guidea robot to the goal from any place • Condition: no supervision but reward at goal • Goal: select optimal action in each position

  3. Markov Decision Process (MDP) • A model consisting of {S, A, T, R} • S: finite set of states, e.g. si = (xi, yi), i=1,2,3…. • A: finite set of actions, e.g. up, down, left, right • T: transition function, specifies next state s’ • R: immediate reward function • Assume MDP is given or can be estimated from data • Policy function: specifies action to take in each state • Goal: learn good policy function from MDP

  4. Reinforcement Learning (RL) • Iterate 1 and 2 (Policy iteration) gives optimal π 1. Evaluate action-value function Q(s,a): discounted sum of future rewards when taking a in s and following π thereafter (Sutton 1998) 2. Update policy • Problem: Q(s,a) can not be evaluated directly r(si,ai): immediate reward when taking action a in state s, γ: discount factor (0<γ<1)

  5. Bellman Equation • Q(s,a) can be evaluated through recursive form • Problem: number of parameters becomes very large in large state and action spaces • Slow learning • Overfitting

  6. Least-Square Policy Iteration Lagoudakis and Parr, 2005 • Linear Architecture φ(s,a): fixed basis function, w: learned weight, K: # of basis functions • Learn so as to optimally approximate Bellman equation in the Least-square sense • # of learning parameters can be reduced dramatically • Problem: How do we choose φi(s,a)?

  7. Popular Choice: Gaussian Kernel ED: Euclid Distance Sc: centre state • Bell-shape centered on Sc • Smooth surface • Gaussian tail goes over the partition Sc One kernel is placed near the partition Sc

  8. Value Function with Discontinuities Approximated value function by 20-randomly locatedGaussian kernels Less accurate around the partition Undesired policies are obtained around the partition Obtained Policy Log scale Optimal value function Log scale

  9. Aim of This Research • Gaussian kernels are not suited for approximating discontinuous value function • Value function is smooth along the maze but discontinuous across the partition • Goal: We propose new kernel, Geodesic Gaussian Kernels, based on the state space structure

  10. S Sc S Sc Gaussian Kernels on Graph Ordinary Gaussian Geodesic Gaussian Shortest Path (Dijkstra Algorithm) Euclidean Distance

  11. Sc Sc Sc Example of Kernels Geodesic Gaussian Ordinary Gaussian Tail does not go across the partition

  12. Value Function by Geodesic Gaussian Approximated value function by 20-randomly locatedGeodesicGaussian kernels Accurate around the partition Desired policies are obtained around the walls Obtained Policies Log scale Optimal value function Log scale

  13. Experimental Results Sutton’s maze Three-room maze • Fraction of optimal states Average over 100 runs

  14. Discussions • Ordinary Gaussian: • Large width suffers from the tail problem • Small width does not have the tail problem, but is less smooth along the state space. • Geodesic Gaussian (with rather large width): • Smooth along the state space, while discontinuity across the partitions preserved.

  15. Arm Robot Control Reward: +1 (reach the apple) 0 (otherwise) • 2-DOF robot arm • Lead the hand to the apple. State space

  16. Learned Value Functions by Ordinary Gaussian smooth over the obstacle

  17. Learned Value Functions by Geodesic Gaussian smooth along the state space

  18. Summary of Results Average over 30 runs

  19. Khepera Robot Navigation • Khepera robot has 8 IR sensors measuring the distance to obstacles (0-1030) • Task: explore unknown maze without collision 6 actions Reward: +10 (a1) +5 (a2/a3) 0 (a4/a5) -4 (a6) -20 (collision)

  20. Difficulty of The Task • State space is high dimensional (8-D) and large (1030^8) • Entire state space can not be explored, thus need inter/extrapolation • State transition is highly stochastic

  21. State Space and Graph • 8D graph by Self-Organized Map is projected onto the 2D subspace for visualization Partitions

  22. Learned Value Functionsby Ordinary Gaussian When Khepera faces an obstacle, it goes backward (and go forward). Local maximum

  23. Learned Value Functionsby Geodesic Gaussian When Khepera faces an obstacle, it makes a turn (and go forward).

  24. Summary of Results Average over 30 runs

  25. Conclusion • Value function approximation: good basis function needed • Ordinary Gaussian kernel: smooth over discontinuities • Geodesic Gaussian kernel: smooth along the state space • Graph Gaussian is promising in high-dimensional continuous problems!

More Related