1 / 11

http://www.youtube.com/watch?v= ucz3JpvDQjk

http://www.youtube.com/watch?v= ucz3JpvDQjk. Learning: Uses experience Planning: Use simulated experience. Prioritized Sweeping. Priority Queue over s,a pairs whose estimated value would change non-trivially When processed, effect on predecessor pairs is computed and added.

wilmer
Download Presentation

http://www.youtube.com/watch?v= ucz3JpvDQjk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://www.youtube.com/watch?v=ucz3JpvDQjk

  2. Learning: Uses experience • Planning: Use simulated experience

  3. Prioritized Sweeping • Priority Queue over s,a pairs whose estimated value would change non-trivially • When processed, effect on predecessor pairs is computed and added “All animals are equal, but some are more equal than others.” • George Orwell, Animal Farm

  4. Prioritized Sweeping vs. Dyna-Q

  5. Trajectory sampling: perform backups along simulated trajectories • Samples from on-policy distribution • May be able to (usefully) ignore large parts of state space

  6. How to backup • Full vs. Sample • Deep vs. Shallow • Heuristic Search

  7. R-Max • Questions? • What was interesting?

  8. R-Max (1/2) • Stochastic games • Matrix: strategic form • Sequence of games from some set of games • More general than MDPs (how map to MDP?) • Implicit Exploration vs. Explicit Exploration? • Epsilon greedy, UCB1, optimistic value function initialization • What will a pessimistic initialization do? • Assumptions: recognize the state it’s in, and knows the actions/payoffs received • Maximin

  9. R-Max (2/2) • Either optimal, or efficient learning • Mixing time • Smallest value of T after which π guarantees expected payoff of U(π) – ε • Approach optimal policy, in time polynomial in T, in 1/ε, and ln(1/δ) • PAC • Why K1 updates needed?

  10. Efficient Reinforcement Learning with Relocatable Action Models • http://bit.ly/1mJ3KDU • RAM • Grid-World Experiment • Algorithm

More Related