1 / 31

Meeting 9 - RL Wrap Up and B&W Tournament

Meeting 9 - RL Wrap Up and B&W Tournament. Course logistic notes Reinforcement Learning Wrap Up (30 minutes) Temporal Difference Learning and Variants Examples Continuous state-action spaces Tournament time – move to 5110. Course Logistics. Assignment 1:

violet
Download Presentation

Meeting 9 - RL Wrap Up and B&W Tournament

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Meeting 9 - RL Wrap Up and B&W Tournament • Course logistic notes • Reinforcement Learning Wrap Up (30 minutes) • Temporal Difference Learning and Variants • Examples • Continuous state-action spaces • Tournament time – move to 5110

  2. Course Logistics • Assignment 1: • Tournament code due several minutes ago • Final code and paper due Thursday. This is the result of the McGill open house last weekend that prevented lab access to some. • Tournament after short conclusion of RL

  3. Value Estimation and Greedy Policy Improvement: Exercise • The exercise was posted in elaborated form on Friday 27 January – See the course website. • The exercise is due on Tuesday 7 February at 16h00. • This will count as one quiz toward your grade.

  4. Reinforcement Learning

  5. RL – Setting

  6. Value and Action-Value Functions • ValueV(s)of a state under the policy: • Action-ValueQ(s,a): Take any action a and follow policy thereafter

  7. Generalized Policy Iteration (for policy improvement) • Iterative cycle of value estimation and improvement of policy:

  8. Value Estimation via Temporal Difference Learning • Idea: use sampling idea of Monte-Carlo, but instead of adjusting V in to better match observed return Rt use revised estimate for the return: • The value function update formula becomes: • Note: if V(st+1) were a perfect estimate, this would be a DP update • This Value Estimation method is called Temporal-Difference (TD) learning

  9. Temporal-Difference (TD) Learning

  10. TD Learning: Advantages • No model of the environment (rewards, probabilities) is needed • TD only needs experience with the environment. • On-line, incremental learning • Both TD and MC converge. TD converges faster • Learn before knowing final outcome • Less memory and peak computation required

  11. TD for learning action values ( “Q-Learning”) • Easy ansatz provides a TD version of action-value learning – • Bellman equation for Qπ: • Dynamic programming update for Q: • TD update for Q(s,a):

  12. On-Policy SARSA learning • Policy improvement: At each time-step, choose action greedy mostly greedy with respect to Q: • After getting reward, seeing st+1, and choosing at+1, update action values according to action-value TD formula:

  13. Grid World – Example Goal – Prey runs about at random Agent – Predator chasing prey

  14. Grid World Example Pursuit before learning

  15. Grid World Example Upon further learning trials

  16. Grid World Example Learned pursuit task

  17. Learning the pursuit behavior

  18. Grid World Example State space and learned value function

  19. Grid World Example State sequence st Action sequence at Reward sequence rt Value sequence V(st) Delta sequence rt+1 - γV(st+1) - V(st)

  20. Continuous (or large) state spaces • Previously implied table implementation: Value[state] • Impractical for continuous or large (multi-dimensional) state spaces • Examples: Chess (~10^43), Robot manipulator (continuous) • B&W state space size? • Both storage problems and generalization problems • Generalization problem: Agent will not have a chance to explore most states of a sufficiently large state space

  21. State Space Generalization: Approaches • Quantize continuous state spaces • Circumvents generalization problem: force small number states • Quantization of continuous state variables, perhaps coarse • Example: Angle in cart-pole problem quantized to (-90, -30, 0, 30, 90)° • Tilings: Impose overlapping - grid • More general approach: function approximation • Simple case: represent V(s) as a set of weights wi and basis functions fi(s) • V(s) = w1 f1(s) + w2 f2(s) + ... + wn fn(s) • More refined methods of function approximation (eg. neural network) in coming weeks • Added benefit: V(s) generalizes to predict value of states not yet visited (interpolating between their values, for example)

  22. Robot that learns to stand

  23. Robot that learns to stand

  24. Robot that learns to stand

  25. Robot that learns to stand • No static solution here to stand-up problem: robot must use momentum to stand up • RL used: • Two-layer heirarchy • TD learning for action values Q(s,a) applied to plan problem sub-goal sequence that may lead to success • Continuous version of TD learning applied to learn to achieve sub-goals • Robot trained several hundred iterations in simulation + 100 or so trials on the physical robot • Details: • J. Morimoto, K. Doya, “Acquisition of stand-up behavior by a real robot using heirarchical reinforcement learning”. J. Robotics Auto. Sys. 36 (2001)

  26. Robot that learns to stand • After several hundred attempts in simulation

  27. Robot that learns to stand • After ~100 hundred additional trials on robot

  28. Wrap up

  29. Wrap up • Next time: • Brief overview of planning in AI • End of 1st section of course devoted to problem solving, search, and RL

  30. Wrap up • Required readings • Russell and Norvig • Chapter 11: Planning • Acknowledgements: • Rich Sutton, Doina Precup, RL materials used here • Kenji Doya, Standing robot materials

  31. Inaugural B&W Computer Tournament • Number of competitors? • Duration of typical game? • t ≈ 50 (total) moves × 10 sec / move ≈ 8 minutes • Stage 1: • Round-robin play • 3 games against randomly selected opponents • t ≈ 35 minutes • Top 8 agents advance. Scoring: Draw = 0, Win = +1, Loss = -1. • Stage 2: • Single elimination seeded bracket play: (((1 8),(4 5)),((2 7),(3 6))) • Top four competitors receive bonus (will deal fairly with drawn agents) • Draws: game drawn after 50 (total) moves, or by referee decision

More Related