1 / 33

Decision Making in Intelligent Systems Lecture 6

Decision Making in Intelligent Systems Lecture 6. BSc course Kunstmatige Intelligentie 2008 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl. Overview of this lecture. Mid-term exam Unified view

gen
Download Presentation

Decision Making in Intelligent Systems Lecture 6

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision Making in Intelligent SystemsLecture 6 BSc course Kunstmatige Intelligentie 2008 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl

  2. Overview of this lecture • Mid-term exam • Unified view • Relationship between MC and TD methods • Intermediate methods • Interactions with function approximation

  3. Mid-term exam vraag 3

  4. Unified View

  5. TD(l) • New variable called eligibility trace e • On each step, decay all traces by gl and increment the trace for the current state by 1 • Accumulating trace

  6. Prediction: On-line Tabular TD(l)

  7. Standard one-step TD • l=0 standard 1-step TD-learning = TD(0)

  8. N-step TD Prediction • Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

  9. Mathematics of N-step TD Prediction • Monte Carlo: • TD: • Use V to estimate remaining return • n-step TD: • 2 step return: • n-step return:

  10. Random Walk Examples • How does 2-step TD work here? • How about 3-step TD?

  11. A Larger Example • Task: 19 state random walk • Do you think there is an optimal n? for everything?

  12. Averaging n-step Returns One backup • n-step methods were introduced to help with TD(l) understanding • Idea: backup an average of several returns • e.g. backup half of 2-step and half of 4-step • Called a complex backup • Draw each component • Label with the weights for that component

  13. Forward View of TD(l) • TD(l) is a method for averaging all n-step backups • weight by ln-1 (time since visitation) • l-return: • Backup using l-return:

  14. l-return Weighting Function Until termination After termination

  15. Relation to TD(0) and MC • l-return can be rewritten as: • If l = 1, you get MC: • If l = 0, you get TD(0) Until termination After termination

  16. Forward View of TD(l) II • Look forward from each state to determine update from future states and rewards:

  17. Backward View of TD(l) • The forward view was for theory • The backward view is for mechanism • New variable called eligibility trace • On each step, decay all traces by gl and increment the trace for the current state by 1 • Accumulating trace

  18. Backward View • Shout dt backwards over time • The strength of your voice decreases with temporal distance by gl

  19. Relation of Backwards View to MC & TD(0) • Using update rule: • As before, if you set l to 0, you get to TD(0) • If you set l to 1, you get MC but in a better way • Can apply TD(1) to continuing tasks • Works incrementally and on-line (instead of waiting to the end of the episode)

  20. Backward updates Forward updates Forward View = Backward View • The forward (theoretical) view of TD(l) is equivalent to the backward (mechanistic) view for off-line updating • The book shows: • On-line updating with small a is similar algebra shown in book

  21. The two views

  22. Variable l • Can generalize to variable l • Here l is a function of time • Could define

  23. Linear Gradient Descent Q(l)

  24. Mountain-Car Results

  25. Baird’s Counterexample

  26. Baird’s Counterexample Cont.

  27. Should We Bootstrap?

  28. Summary • Unified view • Intermediate methods: • N-step/l return, eligibility traces • Can do better than extreme cases • Provide efficient, incremental way to combine MC and TD • Includes advantages of MC (learn when current values are not (yet) accurate; can deal with violations of Markov property) • Includes advantages of TD (using TD error, bootstrapping) • Can significantly speed learning • Does have a slight cost in computation • Issues with convergence when 0<l<1 and using function approximation; but still usually better results when 0<l<1

  29. Actor-Critic Methods • Explicit representation of policy as well as value function • Minimal computation to select actions • Can learn an explicitly stochastic and/or continuous, high-dimensional policy • Can put constraints on policies • Appealing as psychological and neural models

  30. Actor-Critic Details

  31. Eligibility Traces for Actor-Critic Methods • Critic: On-policy learning of Vp. Use TD(l) as described before. • Actor: Needs eligibility traces for each state-action pair. • We change the update equation: • Can change the other actor-critic update: to to where

  32. Open question • What is the relationship between dynamic programming and learning?

  33. Next Class • Lecture next Monday: • Chapter 9 of Sutton & Barto

More Related