decision making in intelligent systems lecture 6 n.
Download
Skip this Video
Download Presentation
Decision Making in Intelligent Systems Lecture 6

Loading in 2 Seconds...

play fullscreen
1 / 33

Decision Making in Intelligent Systems Lecture 6 - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Decision Making in Intelligent Systems Lecture 6. BSc course Kunstmatige Intelligentie 2008 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl. Overview of this lecture. Mid-term exam Unified view

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Decision Making in Intelligent Systems Lecture 6' - gen


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
decision making in intelligent systems lecture 6

Decision Making in Intelligent SystemsLecture 6

BSc course Kunstmatige Intelligentie 2008

Bram Bakker

Intelligent Systems Lab Amsterdam

Informatics Institute

Universiteit van Amsterdam

bram@science.uva.nl

overview of this lecture
Overview of this lecture
  • Mid-term exam
  • Unified view
    • Relationship between MC and TD methods
    • Intermediate methods
    • Interactions with function approximation
slide5
TD(l)
  • New variable called eligibility trace e
    • On each step, decay all traces by gl and increment the trace for the current state by 1
    • Accumulating trace
standard one step td
Standard one-step TD
  • l=0 standard 1-step TD-learning = TD(0)
n step td prediction
N-step TD Prediction
  • Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)
mathematics of n step td prediction
Mathematics of N-step TD Prediction
  • Monte Carlo:
  • TD:
    • Use V to estimate remaining return
  • n-step TD:
    • 2 step return:
    • n-step return:
random walk examples
Random Walk Examples
  • How does 2-step TD work here?
  • How about 3-step TD?
a larger example
A Larger Example
  • Task: 19 state random walk
  • Do you think there is an optimal n? for everything?
averaging n step returns
Averaging n-step Returns

One backup

  • n-step methods were introduced to help with TD(l) understanding
  • Idea: backup an average of several returns
    • e.g. backup half of 2-step and half of 4-step
  • Called a complex backup
    • Draw each component
    • Label with the weights for that component
forward view of td l
Forward View of TD(l)
  • TD(l) is a method for averaging all n-step backups
    • weight by ln-1 (time since visitation)
    • l-return:
  • Backup using l-return:
l return weighting function
l-return Weighting Function

Until termination

After termination

relation to td 0 and mc
Relation to TD(0) and MC
  • l-return can be rewritten as:
  • If l = 1, you get MC:
  • If l = 0, you get TD(0)

Until termination

After termination

forward view of td l ii
Forward View of TD(l) II
  • Look forward from each state to determine update from future states and rewards:
backward view of td l
Backward View of TD(l)
  • The forward view was for theory
  • The backward view is for mechanism
  • New variable called eligibility trace
    • On each step, decay all traces by gl and increment the trace for the current state by 1
    • Accumulating trace
backward view
Backward View
  • Shout dt backwards over time
  • The strength of your voice decreases with temporal distance by gl
relation of backwards view to mc td 0
Relation of Backwards View to MC & TD(0)
  • Using update rule:
  • As before, if you set l to 0, you get to TD(0)
  • If you set l to 1, you get MC but in a better way
    • Can apply TD(1) to continuing tasks
    • Works incrementally and on-line (instead of waiting to the end of the episode)
forward view backward view

Backward updates

Forward updates

Forward View = Backward View
  • The forward (theoretical) view of TD(l) is equivalent to the backward (mechanistic) view for off-line updating
  • The book shows:
  • On-line updating with small a is similar

algebra shown in book

variable l
Variable l
  • Can generalize to variable l
  • Here l is a function of time
    • Could define
summary
Summary
  • Unified view
  • Intermediate methods:
    • N-step/l return, eligibility traces
    • Can do better than extreme cases
  • Provide efficient, incremental way to combine MC and TD
    • Includes advantages of MC (learn when current values are not (yet) accurate; can deal with violations of Markov property)
    • Includes advantages of TD (using TD error, bootstrapping)
  • Can significantly speed learning
  • Does have a slight cost in computation
  • Issues with convergence when 0<l<1 and using function approximation; but still usually better results when 0<l<1
actor critic methods
Actor-Critic Methods
  • Explicit representation of policy as well as value function
  • Minimal computation to select actions
  • Can learn an explicitly stochastic and/or continuous, high-dimensional policy
  • Can put constraints on policies
  • Appealing as psychological and neural models
eligibility traces for actor critic methods
Eligibility Traces for Actor-Critic Methods
  • Critic: On-policy learning of Vp. Use TD(l) as described before.
  • Actor: Needs eligibility traces for each state-action pair.
  • We change the update equation:
  • Can change the other actor-critic update:

to

to

where

open question
Open question
  • What is the relationship between dynamic programming and learning?
next class
Next Class
  • Lecture next Monday:
    • Chapter 9 of Sutton & Barto