Reinforcement Learning Eligibility Traces

Reinforcement LearningEligibility Traces 主講人：虞台文大同大學資工所智慧型多媒體研究室

Content • n-step TD prediction • Forward View of TD() • Backward View of TD() • Equivalence of the Forward and Backward Views • Sarsa() • Q() • Eligibility Traces for Actor-Critic Methods • Replacing Traces • Implementation Issues

Reinforcement LearningEligibility Traces n-Step TD Prediction 大同大學資工所智慧型多媒體研究室

Elementary Methods Monte Carlo Methods Dynamic Programming TD(0)

Monte Carlo vs. TD(0) • Monte Carlo • observe reward for all steps in an episode • TD(0) • observed one step only

TD (1-step) 2-step 3-step n-step Monte Carlo n-Step TD Prediction

corrected n-step truncated return n-Step TD Prediction

Backups Monte Carlo TD(0) n-step TD

n-Step TD Backup online offline When offline, the new V(s) will be for the next episode.

Error Reduction Property online offline Maximum error using n-step return Maximum error using V (current value) n-step return

start 0 0 0 0 0 1 A B C D E V(s) 1/6 2/6 3/6 4/6 5/6 Example (Random Walk) Consider 2-step TD, 3-step TD, … n=? is optimal?

start 1 0 0 0 0 1 online offline Average RMSE Over First 10 Trials Example (19-state Random Walk)

+1 1 Standard moves Exercise (Random Walk)

+1 1 Standard moves • Evaluate value function for random policy • Approximate value function using n-step TD (try different n’s and ’s), and compare their performance. • Find optimal policy. Exercise (Random Walk)

Reinforcement LearningEligibility Traces The Forward View of TD() 大同大學資工所智慧型多媒體研究室

One backup Sum to 1 Averaging n-step Returns • We are not limited to simply using n-step TD returns • For example, we could take average n-step TD returns like:

TD()  -Return • TD() is a method for averaging all n-step backups • weight by n1(time since visitation) • Called-return • Backup using -return: w1 w2 w3 wTt 1

TD()  -Return • TD() is a method for averaging all n-step backups • weight by n1(time since visitation) • Called-return • Backup using -return: w1 w2 w3 wTt

Forward View of TD() A theoretical view

TD() on the Random Walk

Reinforcement LearningEligibility Traces The Backward View of TD() 大同大學資工所智慧型多媒體研究室

Why Backward View? • Forward view is acausal • Not implementable • Backward view is causal • Implementable • In the offline case, achieving the same result as the forward view

Eligibility Traces • Each state is associated with an additional memory variable eligibility trace, defined by:

Eligibility  Recency of Visiting • At any time, the traces record which states have recently been visited, where “recently" is defined in terms of . • The traces indicate the degree to which each state is eligible for undergoing learning changes should a reinforcing event occur. • Reinforcing event The moment-by-moment 1-step TD errors

Reinforcing Event The moment-by-moment 1-step TD errors

TD() Eligibility Traces Reinforcing Events Value updates

Online TD()

Backward View of TD()

Backwards View vs. MC & TD(0) • Setto 0, we get to TD(0) • Set  to 1, we get MC but in a better way • Can apply TD(1) to continuing tasks • Works incrementally and on-line (instead of waiting to the end of the episode) How about 0 <  < 1?

Reinforcement LearningEligibility Traces Equivalence of the Forward and Backward Views 大同大學資工所智慧型多媒體研究室

Offline TD()’s Offline Forward TD()  -Return Offline Backward TD()

Forward View = Backward View Forward updates Backward updates See the proof

Forward View = Backward View Forward updates Backward updates

Offline -return (forward) Online TD() (backward) Average RMSE Over First 10 Trials TD() on the Random Walk

Reinforcement LearningEligibility Traces Sarsa() 大同大學資工所智慧型多媒體研究室

Sarsa() • TD()  • Use eligibility traces for policy evaluation • How can eligibility traces be used for control? • Learn Qt(s, a) rather than Vt(s).

Sarsa() Eligibility Traces Reinforcing Events Updates

Sarsa()

Sarsa()  Traces in Grid World • With one trial, the agent has much more information about how to get to the goal • not necessarily the best way • Considerably accelerate learning

Reinforcement LearningEligibility Traces Q() 大同大學資工所智慧型多媒體研究室

Q-Learning • An off-policy method • breaks from time to time to take exploratory actions • a simple time trace cannot be easily implemented • How to combine eligibility traces and Q-learning? • Three methods: • Watkins's Q() • Peng's Q () • Naïve Q ()

First non-greedy action Watkins's Q() Estimation policy (e.g., greedy) Behavior policy (e.g., -greedy) Greedy Path Non-Greedy Path

How to define the eligibility traces? Backups  Watkins's Q() Two cases: • Both behavior and estimation policies take the greedy path. • Behavior path has taken a non-greedy action before the episode ends. Case 1 Case 2

Watkins's Q()

Peng's Q() • Cutting off traces loses much of the advantage of using eligibility traces. • If exploratory actions are frequent, as they often are early in learning, then only rarely will backups of more than one or two steps be done, and learning may be little faster than 1-step Q-learning. • Peng's Q() is an alternate version of Q() meant to remedy this.

Backups  Peng's Q() Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning.Machine Learning, 22(1/2/3). • Never cut traces • Backup max action except at end • The book says it outperforms Watkins Q(λ) and almost as well as Sarsa(λ) • Disadvantage: difficult for implementation

Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning.Machine Learning, 22(1/2/3). See for notations. Peng's Q()

Reinforcement Learning Eligibility Traces

Reinforcement Learning Eligibility Traces

Presentation Transcript

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Chapter 7: Eligibility Traces

REINFORCEMENT LEARNING

Reinforcement Learning

Reinforcement Learning

Eligibility Traces (ETs) Week #7

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning