decision making in intelligent systems lecture 6
Download
Skip this Video
Download Presentation
Decision Making in Intelligent Systems Lecture 6

Loading in 2 Seconds...

play fullscreen
1 / 33

Decision Making in Intelligent Systems Lecture 6 - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Decision Making in Intelligent Systems Lecture 6. BSc course Kunstmatige Intelligentie 2008 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam [email protected] Overview of this lecture. Mid-term exam Unified view

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Decision Making in Intelligent Systems Lecture 6' - gen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
decision making in intelligent systems lecture 6

Decision Making in Intelligent SystemsLecture 6

BSc course Kunstmatige Intelligentie 2008

Bram Bakker

Intelligent Systems Lab Amsterdam

Informatics Institute

Universiteit van Amsterdam

[email protected]

overview of this lecture
Overview of this lecture
  • Mid-term exam
  • Unified view
    • Relationship between MC and TD methods
    • Intermediate methods
    • Interactions with function approximation
slide5
TD(l)
  • New variable called eligibility trace e
    • On each step, decay all traces by gl and increment the trace for the current state by 1
    • Accumulating trace
standard one step td
Standard one-step TD
  • l=0 standard 1-step TD-learning = TD(0)
n step td prediction
N-step TD Prediction
  • Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)
mathematics of n step td prediction
Mathematics of N-step TD Prediction
  • Monte Carlo:
  • TD:
    • Use V to estimate remaining return
  • n-step TD:
    • 2 step return:
    • n-step return:
random walk examples
Random Walk Examples
  • How does 2-step TD work here?
  • How about 3-step TD?
a larger example
A Larger Example
  • Task: 19 state random walk
  • Do you think there is an optimal n? for everything?
averaging n step returns
Averaging n-step Returns

One backup

  • n-step methods were introduced to help with TD(l) understanding
  • Idea: backup an average of several returns
    • e.g. backup half of 2-step and half of 4-step
  • Called a complex backup
    • Draw each component
    • Label with the weights for that component
forward view of td l
Forward View of TD(l)
  • TD(l) is a method for averaging all n-step backups
    • weight by ln-1 (time since visitation)
    • l-return:
  • Backup using l-return:
l return weighting function
l-return Weighting Function

Until termination

After termination

relation to td 0 and mc
Relation to TD(0) and MC
  • l-return can be rewritten as:
  • If l = 1, you get MC:
  • If l = 0, you get TD(0)

Until termination

After termination

forward view of td l ii
Forward View of TD(l) II
  • Look forward from each state to determine update from future states and rewards:
backward view of td l
Backward View of TD(l)
  • The forward view was for theory
  • The backward view is for mechanism
  • New variable called eligibility trace
    • On each step, decay all traces by gl and increment the trace for the current state by 1
    • Accumulating trace
backward view
Backward View
  • Shout dt backwards over time
  • The strength of your voice decreases with temporal distance by gl
relation of backwards view to mc td 0
Relation of Backwards View to MC & TD(0)
  • Using update rule:
  • As before, if you set l to 0, you get to TD(0)
  • If you set l to 1, you get MC but in a better way
    • Can apply TD(1) to continuing tasks
    • Works incrementally and on-line (instead of waiting to the end of the episode)
forward view backward view

Backward updates

Forward updates

Forward View = Backward View
  • The forward (theoretical) view of TD(l) is equivalent to the backward (mechanistic) view for off-line updating
  • The book shows:
  • On-line updating with small a is similar

algebra shown in book

variable l
Variable l
  • Can generalize to variable l
  • Here l is a function of time
    • Could define
summary
Summary
  • Unified view
  • Intermediate methods:
    • N-step/l return, eligibility traces
    • Can do better than extreme cases
  • Provide efficient, incremental way to combine MC and TD
    • Includes advantages of MC (learn when current values are not (yet) accurate; can deal with violations of Markov property)
    • Includes advantages of TD (using TD error, bootstrapping)
  • Can significantly speed learning
  • Does have a slight cost in computation
  • Issues with convergence when 0<l<1 and using function approximation; but still usually better results when 0<l<1
actor critic methods
Actor-Critic Methods
  • Explicit representation of policy as well as value function
  • Minimal computation to select actions
  • Can learn an explicitly stochastic and/or continuous, high-dimensional policy
  • Can put constraints on policies
  • Appealing as psychological and neural models
eligibility traces for actor critic methods
Eligibility Traces for Actor-Critic Methods
  • Critic: On-policy learning of Vp. Use TD(l) as described before.
  • Actor: Needs eligibility traces for each state-action pair.
  • We change the update equation:
  • Can change the other actor-critic update:

to

to

where

open question
Open question
  • What is the relationship between dynamic programming and learning?
next class
Next Class
  • Lecture next Monday:
    • Chapter 9 of Sutton & Barto
ad