1 / 71

Reinforcement Learning Eligibility Traces

Reinforcement Learning Eligibility Traces. 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Content. n-step TD prediction Forward View of TD(  ) Backward View of TD(  ) Equivalence of the Forward and Backward Views Sarsa(  ) Q(  ) Eligibility Traces for Actor-Critic Methods Replacing Traces

Download Presentation

Reinforcement Learning Eligibility Traces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement LearningEligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室

  2. Content • n-step TD prediction • Forward View of TD() • Backward View of TD() • Equivalence of the Forward and Backward Views • Sarsa() • Q() • Eligibility Traces for Actor-Critic Methods • Replacing Traces • Implementation Issues

  3. Reinforcement LearningEligibility Traces n-Step TD Prediction 大同大學資工所 智慧型多媒體研究室

  4. Elementary Methods Monte Carlo Methods Dynamic Programming TD(0)

  5. Monte Carlo vs. TD(0) • Monte Carlo • observe reward for all steps in an episode • TD(0) • observed one step only

  6. TD (1-step) 2-step 3-step n-step Monte Carlo n-Step TD Prediction

  7. corrected n-step truncated return n-Step TD Prediction

  8. Backups Monte Carlo TD(0) n-step TD

  9. n-Step TD Backup online offline When offline, the new V(s) will be for the next episode.

  10. Error Reduction Property online offline Maximum error using n-step return Maximum error using V (current value) n-step return

  11. start 0 0 0 0 0 1 A B C D E V(s) 1/6 2/6 3/6 4/6 5/6 Example (Random Walk) Consider 2-step TD, 3-step TD, … n=? is optimal?

  12. start 1 0 0 0 0 1 online offline Average RMSE Over First 10 Trials Example (19-state Random Walk)

  13. +1 1 Standard moves Exercise (Random Walk)

  14. +1 1 Standard moves • Evaluate value function for random policy • Approximate value function using n-step TD (try different n’s and ’s), and compare their performance. • Find optimal policy. Exercise (Random Walk)

  15. Reinforcement LearningEligibility Traces The Forward View of TD() 大同大學資工所 智慧型多媒體研究室

  16. One backup Sum to 1 Averaging n-step Returns • We are not limited to simply using n-step TD returns • For example, we could take average n-step TD returns like:

  17. TD()  -Return • TD() is a method for averaging all n-step backups • weight by n1(time since visitation) • Called-return • Backup using -return: w1 w2 w3 wTt 1

  18. TD()  -Return • TD() is a method for averaging all n-step backups • weight by n1(time since visitation) • Called-return • Backup using -return: w1 w2 w3 wTt

  19. Forward View of TD() A theoretical view

  20. TD() on the Random Walk

  21. Reinforcement LearningEligibility Traces The Backward View of TD() 大同大學資工所 智慧型多媒體研究室

  22. Why Backward View? • Forward view is acausal • Not implementable • Backward view is causal • Implementable • In the offline case, achieving the same result as the forward view

  23. Eligibility Traces • Each state is associated with an additional memory variable eligibility trace, defined by:

  24. Eligibility Traces • Each state is associated with an additional memory variable eligibility trace, defined by:

  25. Eligibility Traces • Each state is associated with an additional memory variable eligibility trace, defined by:

  26. Eligibility  Recency of Visiting • At any time, the traces record which states have recently been visited, where “recently" is defined in terms of . • The traces indicate the degree to which each state is eligible for undergoing learning changes should a reinforcing event occur. • Reinforcing event The moment-by-moment 1-step TD errors

  27. Reinforcing Event The moment-by-moment 1-step TD errors

  28. TD() Eligibility Traces Reinforcing Events Value updates

  29. Online TD()

  30. Backward View of TD()

  31. Backwards View vs. MC & TD(0) • Setto 0, we get to TD(0) • Set  to 1, we get MC but in a better way • Can apply TD(1) to continuing tasks • Works incrementally and on-line (instead of waiting to the end of the episode) How about 0 <  < 1?

  32. Reinforcement LearningEligibility Traces Equivalence of the Forward and Backward Views 大同大學資工所 智慧型多媒體研究室

  33. Offline TD()’s Offline Forward TD()  -Return Offline Backward TD()

  34. Forward View = Backward View Forward updates Backward updates See the proof

  35. Forward View = Backward View Forward updates Backward updates

  36. Offline -return (forward) Online TD() (backward) Average RMSE Over First 10 Trials TD() on the Random Walk

  37. Reinforcement LearningEligibility Traces Sarsa() 大同大學資工所 智慧型多媒體研究室

  38. Sarsa() • TD()  • Use eligibility traces for policy evaluation • How can eligibility traces be used for control? • Learn Qt(s, a) rather than Vt(s).

  39. Sarsa() Eligibility Traces Reinforcing Events Updates

  40. Sarsa()

  41. Sarsa()  Traces in Grid World • With one trial, the agent has much more information about how to get to the goal • not necessarily the best way • Considerably accelerate learning

  42. Reinforcement LearningEligibility Traces Q() 大同大學資工所 智慧型多媒體研究室

  43. Q-Learning • An off-policy method • breaks from time to time to take exploratory actions • a simple time trace cannot be easily implemented • How to combine eligibility traces and Q-learning? • Three methods: • Watkins's Q() • Peng's Q () • Naïve Q ()

  44. First non-greedy action Watkins's Q() Estimation policy (e.g., greedy) Behavior policy (e.g., -greedy) Greedy Path Non-Greedy Path

  45. How to define the eligibility traces? Backups  Watkins's Q() Two cases: • Both behavior and estimation policies take the greedy path. • Behavior path has taken a non-greedy action before the episode ends. Case 1 Case 2

  46. Watkins's Q()

  47. Watkins's Q()

  48. Peng's Q() • Cutting off traces loses much of the advantage of using eligibility traces. • If exploratory actions are frequent, as they often are early in learning, then only rarely will backups of more than one or two steps be done, and learning may be little faster than 1-step Q-learning. • Peng's Q() is an alternate version of Q() meant to remedy this.

  49. Backups  Peng's Q() Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning.Machine Learning, 22(1/2/3). • Never cut traces • Backup max action except at end • The book says it outperforms Watkins Q(λ) and almost as well as Sarsa(λ) • Disadvantage: difficult for implementation

  50. Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning.Machine Learning, 22(1/2/3). See for notations. Peng's Q()

More Related