1 / 7

Temporal Difference Learning

Temporal Difference Learning. Mark Romero – 11/03/2011. Introduction. Temporal Difference Learning combines idea from the Monte Carlo Methods and Dynamic Programming Still sample the environment based on some policy Determine current estimate based on previous estimates

raine
Download Presentation

Temporal Difference Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Temporal Difference Learning Mark Romero – 11/03/2011

  2. Introduction • Temporal Difference Learning combines idea from the Monte Carlo Methods and Dynamic Programming • Still sample the environment based on some policy • Determine current estimate based on previous estimates • Predictions are adjusted as time goes on to match other more accurate predications • Temporal Difference Learning is popular for its simplicity and on-line applications

  3. MC vs TD • Constant-α MC: R(t) – actual return (reward) α – constant step-sized parameter Because the actual return is used, we must wait until the end of the episode to determine the update to V.

  4. MC vs TD • TD(0): rt+1 – observed award γ– discount rate TD method only waits for the next time step. At time t+1 a target can be formed and an update made using the observed reward, rt+1 , and estimate, V(st+1). In effect, TD(0) targets rt+1 + γV(st+1) instead of R(t) in the MC method Called bootstrapping because update is based on previous estimate

  5. Psuedo Code Initialize V(s) arbitrarily, and π to the policy to be evaluated Repeat (for each episode): Initialize s Repeat (for each step of episode): α <- action given by π for s Take action α observe reward r, and next state, s’ V(s) <- V(s) + α[r + γV(s’) – V(s)] s <- s’ until s is terminal

  6. Advantages over MC • Lends itself naturally to on-line applications • MC must wait until end of the episode to adjust reward, TD only needs one time step • Turns out this is critical consideration • Some applications have long episodes or no episodes at all • TD learns from every transition • MC methods generally discount or throw out episodes where an experimental action was taken • TD converges faster than constant-α MC in practice • No formal proof has been developed

  7. Soundness • Is TD sound? • Yes, for any fixed policy the TD algorithm has been proven to Vπ, provided a sufficiently small constant step-size parameter, or if the step-size parameter decreases according to the usual stochastic approximation conditions.

More Related