1 / 17

Reinforcement Learning an introduction part 4

Reinforcement Learning an introduction part 4. Ann Nowé Ann.nowe@vub.ac.be http://como.vub.ac.be. By Sutton and Barto. Backup diagrams in DP. State-value function for policy . V(s 1 ). V(s 1 ’ ). V(s). V(s 2 ). V(s 2 ’ ). V(s 3 ). V(s 3 ’ ). Action-values function for policy .

santa
Download Presentation

Reinforcement Learning an introduction part 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning an introductionpart 4 Ann Nowé Ann.nowe@vub.ac.behttp://como.vub.ac.be By Sutton and Barto

  2. Backup diagrams in DP State-value function for policy  V(s1) V(s1’) V(s) V(s2) V(s2’) V(s3) V(s3’) Action-values function for policy  Q(s1,a1) s1 Q(s1,a2) Q(s,a) s2 Q(s2,a1) Q(s2,a2)

  3. T T T T T T T T T T Dynamic Programming, model based T T T

  4. Recall Value Iteration in DP Q(s,a)

  5. T T T T T T T T T T T T T T T T T T T T RL, model free

  6. Q-Learning, a value iteration approach Q-learning is off-policy

  7. example 0.3 4 R=10 b 0.2 2 R=2 a 0.7 5 c 1 R=1 R=1 R=5 1 3 0.8 d 1 Epoch 1: 1,2,4 Epoch 2: 1,6 Epoch 3: 1,3 Epoch 4: 1,2,5 Epoch 6: 2,5 R=4 6

  8. Some convergence issues Q-learning in guaranteed to converge in a Markovian setting Tsitsiklis J.N. Asynchronous Stochastic Approximation and Q-learning. Machine Learning, Vol. 16:185-202, 1994.

  9. Proof by Tsitsiklis, cont. On the convergence of Q-learning

  10. Proof by Tsitsiklis On the convergence of Q-learning Q(s,a) Noiseterm “Learning factor” q vector, but with possibly outdated components Contraction mapping

  11. Proof by Tsitsiklis, cont. Stochastic approximation, as a vector qi Fi Fi + noise t qj

  12. Proof by Tsitsiklis, cont. Relating Q-learning to stochastic approximation Contraction mapping Bellman operator ith component Noise term Can vary in time

  13. Sarsa: On-Policy TD Control When is Sarsa = Q-learning?

  14. Q-Learning versus SARSA Q-learning is off-policy Sarsa Q-learning is on-policy

  15. Cliff Walking example Actions: up, down, left, right Reward: cliff -100, goal 0, default -1. Action selection -greedy, with  = 0.1 Sarsa takes exploration into account

  16. Q-learning for CAC Acceptance Criterion: Maximize Network Revenue Class-1 S1 = (2,4) [ Q(s1,A1) Q(s1,R1) S2=(3,4) S3 = (3,3) Class-2 [ Q(s3,A2) Q(s3,R2)

  17. Call Arrival Call Arrival tn t1 t2 System state: x System state: y t0 = 0  Call Departure Call Departure Call Departure Continuous Time Q-learning for CAC [Bratke]

More Related