Decision making in intelligent systems lecture 6
This presentation is the property of its rightful owner.
Sponsored Links
1 / 33

Decision Making in Intelligent Systems Lecture 6 PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on
  • Presentation posted in: General

Decision Making in Intelligent Systems Lecture 6. BSc course Kunstmatige Intelligentie 2008 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam [email protected] Overview of this lecture. Mid-term exam Unified view

Download Presentation

Decision Making in Intelligent Systems Lecture 6

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Decision making in intelligent systems lecture 6

Decision Making in Intelligent SystemsLecture 6

BSc course Kunstmatige Intelligentie 2008

Bram Bakker

Intelligent Systems Lab Amsterdam

Informatics Institute

Universiteit van Amsterdam

[email protected]


Overview of this lecture

Overview of this lecture

  • Mid-term exam

  • Unified view

    • Relationship between MC and TD methods

    • Intermediate methods

    • Interactions with function approximation


Mid term exam vraag 3

Mid-term exam vraag 3


Unified view

Unified View


Decision making in intelligent systems lecture 6

TD(l)

  • New variable called eligibility trace e

    • On each step, decay all traces by gl and increment the trace for the current state by 1

    • Accumulating trace


Prediction on line tabular td l

Prediction: On-line Tabular TD(l)


Standard one step td

Standard one-step TD

  • l=0 standard 1-step TD-learning = TD(0)


N step td prediction

N-step TD Prediction

  • Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)


Mathematics of n step td prediction

Mathematics of N-step TD Prediction

  • Monte Carlo:

  • TD:

    • Use V to estimate remaining return

  • n-step TD:

    • 2 step return:

    • n-step return:


Random walk examples

Random Walk Examples

  • How does 2-step TD work here?

  • How about 3-step TD?


A larger example

A Larger Example

  • Task: 19 state random walk

  • Do you think there is an optimal n? for everything?


Averaging n step returns

Averaging n-step Returns

One backup

  • n-step methods were introduced to help with TD(l) understanding

  • Idea: backup an average of several returns

    • e.g. backup half of 2-step and half of 4-step

  • Called a complex backup

    • Draw each component

    • Label with the weights for that component


Forward view of td l

Forward View of TD(l)

  • TD(l) is a method for averaging all n-step backups

    • weight by ln-1 (time since visitation)

    • l-return:

  • Backup using l-return:


L return weighting function

l-return Weighting Function

Until termination

After termination


Relation to td 0 and mc

Relation to TD(0) and MC

  • l-return can be rewritten as:

  • If l = 1, you get MC:

  • If l = 0, you get TD(0)

Until termination

After termination


Forward view of td l ii

Forward View of TD(l) II

  • Look forward from each state to determine update from future states and rewards:


Backward view of td l

Backward View of TD(l)

  • The forward view was for theory

  • The backward view is for mechanism

  • New variable called eligibility trace

    • On each step, decay all traces by gl and increment the trace for the current state by 1

    • Accumulating trace


Backward view

Backward View

  • Shout dt backwards over time

  • The strength of your voice decreases with temporal distance by gl


Relation of backwards view to mc td 0

Relation of Backwards View to MC & TD(0)

  • Using update rule:

  • As before, if you set l to 0, you get to TD(0)

  • If you set l to 1, you get MC but in a better way

    • Can apply TD(1) to continuing tasks

    • Works incrementally and on-line (instead of waiting to the end of the episode)


Forward view backward view

Backward updates

Forward updates

Forward View = Backward View

  • The forward (theoretical) view of TD(l) is equivalent to the backward (mechanistic) view for off-line updating

  • The book shows:

  • On-line updating with small a is similar

algebra shown in book


The two views

The two views


Variable l

Variable l

  • Can generalize to variable l

  • Here l is a function of time

    • Could define


Linear gradient descent q l

Linear Gradient Descent Q(l)


Mountain car results

Mountain-Car Results


Baird s counterexample

Baird’s Counterexample


Baird s counterexample cont

Baird’s Counterexample Cont.


Should we bootstrap

Should We Bootstrap?


Summary

Summary

  • Unified view

  • Intermediate methods:

    • N-step/l return, eligibility traces

    • Can do better than extreme cases

  • Provide efficient, incremental way to combine MC and TD

    • Includes advantages of MC (learn when current values are not (yet) accurate; can deal with violations of Markov property)

    • Includes advantages of TD (using TD error, bootstrapping)

  • Can significantly speed learning

  • Does have a slight cost in computation

  • Issues with convergence when 0<l<1 and using function approximation; but still usually better results when 0<l<1


Actor critic methods

Actor-Critic Methods

  • Explicit representation of policy as well as value function

  • Minimal computation to select actions

  • Can learn an explicitly stochastic and/or continuous, high-dimensional policy

  • Can put constraints on policies

  • Appealing as psychological and neural models


Actor critic details

Actor-Critic Details


Eligibility traces for actor critic methods

Eligibility Traces for Actor-Critic Methods

  • Critic: On-policy learning of Vp. Use TD(l) as described before.

  • Actor: Needs eligibility traces for each state-action pair.

  • We change the update equation:

  • Can change the other actor-critic update:

to

to

where


Open question

Open question

  • What is the relationship between dynamic programming and learning?


Next class

Next Class

  • Lecture next Monday:

    • Chapter 9 of Sutton & Barto


  • Login