1 / 84

Tópicos Especiais em Aprendizagem

Tópicos Especiais em Aprendizagem. Prof. Reinaldo Bianchi Centro Universitário da FEI 2006. Objetivo desta Aula. Aprendizado por Reforço: Métodos de Monte Carlo. Aprendizado por Diferenças Temporais. Traços de Elegibilidade. Aula de hoje: capítulos 5, 6 e 7 do Sutton & Barto.

bryga
Download Presentation

Tópicos Especiais em Aprendizagem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006

  2. Objetivo desta Aula • Aprendizado por Reforço: • Métodos de Monte Carlo. • Aprendizado por Diferenças Temporais. • Traços de Elegibilidade. • Aula de hoje: capítulos 5, 6 e 7 do Sutton & Barto.

  3. Aprendizado pelo Método das Diferenças Temporais Capítulo 6 do Livro Sutton e Barto.

  4. Temporal Difference Learning • Objectives of this chapter... • Introduce Temporal Difference (TD) learning • Focus first on policy evaluation, or prediction, methods • Then extend to control methods

  5. Central Idea • If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning. • TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. • Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. • Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

  6. TD (0) Prediction Policy Evaluation (the prediction problem): for a given policy p, compute the state-value function Recall: target: the actual return after time t target: an estimate of the return

  7. TD (0) Prediction • We know from DP:

  8. T T T T T T T T T T T T T T T T T T T T Simple Monte Carlo

  9. T T T T T T T T T T T T T T T T T T T T Simplest TD Method

  10. T T T T T T T T T T cf. Dynamic Programming T T T

  11. Tabular TD(0) Algorithm

  12. TD Bootstraps and Samples • Bootstrapping: update involves an estimate • MC does not bootstrap • DP bootstraps • TD bootstraps • Sampling: update does not involve an expected value • MC samples • DP does not sample • TD samples

  13. Example: Driving Home

  14. Driving Home • Changes recommended by Monte Carlo methods (a=1) • Changes recommended • by TD methods (a=1)

  15. Advantages of TD Learning • TD methods do not require a model of the environment, only experience. • TD, but not MC, methods can be fully incremental: • You can learn before knowing the final outcome • Less memory • Less peak computation • You can learn without the final outcome • From incomplete sequences • Both MC and TD converge (under certain assumptions to be detailed later), but which is faster?

  16. Random Walk Example • Empirically compare the prediction abilities of TD(0) and constant learning rate MC applied to the small Markov process:

  17. Random Walk Example • All episodes start in the center state, C, and proceed either left or right by one state on each step, with equal probability. • Episodes terminate either on the extreme left or the extreme right. • Rewards: • When an episode terminates on the right a reward of +1 occurs; • all other rewards are zero.

  18. Random Walk Example • Values learned by TD(0) after various numbers of episodes.

  19. Random Walk Example • The final estimate is about as close as the estimates ever get to the true values. • With a constant step-size parameter (in this example), the values fluctuate indefinitely in response to the outcomes of the most recent episodes.

  20. TD and MC on the Random Walk Learning curves for TD(0) and constant-learning rate MC. • Data averaged over • 100 sequences of episodes

  21. Optimality of TD(0) Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Compute updates according to TD(0), but only update estimates after each complete pass through the data. For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small a. Constant-a MC also converges under these conditions, but to a difference answer!

  22. Random Walk under Batch Updating • After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.

  23. Example: You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0

  24. You are the Predictor

  25. You are the Predictor • The prediction that best matches the training data is V(A)=0 • This minimizes the mean-square-error on the training set • This is what a batch Monte Carlo method gets • If we consider the sequentiality of the problem, then we would set V(A)=.75 • This is correct for the maximum likelihood estimate of a Markov model generating the data • i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) • This is called the certainty-equivalence estimate • This is what TD(0) gets

  26. And now for some methods...

  27. Learning An Action-Value Function • We turn now to the use of TD prediction methods for the control problem. • As usual, we follow the pattern of generalized policy iteration (GPI), only this time using TD methods for the evaluation or prediction part. • As with Monte Carlo methods, we face the need to trade off exploration and exploitation, and again approaches fall into two main classes: • on-policy and • off-policy.

  28. Sarsa: On-Policy TD Control

  29. Sarsa: On-Policy TD Control • This update is done after every transition from a nonterminal state st. • If st+1 is terminal, then Q(st+1 ,at+1) is defined as zero. • This rule uses every element of the quintuple of events, (st, at, rt+1, st+1 ,at+1) , that make up a transition from one state-action pair to the next. • This quintuple gives rise to the name Sarsa for the algorithm.

  30. Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate:

  31. Windy Gridworld undiscounted, episodic, reward = –1 until goal

  32. Results of Sarsa on the Windy Gridworld

  33. Q-Learning: Off-Policy TD Control • One of the most important breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning (Watkins, 1989). • Its simplest form, one-step Q-learning, is defined by:

  34. Q-Learning: Off-Policy TD Control • In this case, the learned action-value function, Q , directly approximates Q* , the optimal action-value function, independent of the policy being followed. • This dramatically simplifies the analysis of the algorithm and enabled early convergence proofs: • all that is required for correct convergence is that all pairs continue to be updated.

  35. Q-Learning Algorithm

  36. Example 6.6: Cliff Walking  • This example compares Sarsa and Q-learning, highlighting the difference between on-policy (Sarsa) and off-policy (Q-learning) methods. • Consider the gridworld shown in the upper part of Figure  6.13. • Standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left. • Reward is –1 on all transitions except those into the the region marked "The Cliff." Stepping into this region incurs a reward of -100 and sends the agent instantly back to the start.

  37. Example 6.6: Cliff Walking  • e-greedy, e = 0.1

  38. Example 6.6: Cliff Walking  • After an initial transient, Q-learning learns values for the optimal policy, that which travels right along the edge of the cliff. • Unfortunately, this results in its occasionally falling off the cliff because of the epsilon-greedy action selection. • Sarsa, on the other hand, takes the action selection into account and learns the longer but safer path through the upper part of the grid.

  39. Example 6.6: Cliff Walking  • Although Q-learning actually learns the values of the optimal policy, its on-line performance is worse than that of Sarsa, which learns the roundabout policy. • Of course, if epsilon were gradually reduced, then both methods would asymptotically converge to the optimal policy.

  40. Actor-Critic Methods • Explicit representation of policy as well as value function • Minimal computation to select actions • Can learn an explicit stochastic policy • Can put constraints on policies • Appealing as psychological and neural models

  41. Actor-Critic Details

  42. Summary – TD Learning • TD prediction • Introduced one-step tabular model-free TD(0) methods • Extend prediction to control by employing some form of GPI • On-policy control: Sarsa • Off-policy control: Q-learning • These methods bootstrap and sample, combining aspects of DP and MC methods

  43. Eligibility Traces Capítulo 7

  44. Eligibility Traces

  45. N-step TD Prediction • Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

  46. Mathematics of N-step TD Prediction • Monte Carlo: • TD: • Use V to estimate remaining return • n-step TD: • 2 step return: • n-step return:

  47. Maximum error using n-step return Maximum error using V Learning with N-step Backups • Backup (on-line or off-line): • Error reduction property of n-step returns: n step return

  48. Random Walk Examples • How does 2-step TD work here? • How about 3-step TD?

  49. A Larger Example • Task: 19 state random walk • Do you think there is an optimal n (for everything)?

  50. Averaging N-step Returns One backup • n-step methods were introduced to help with TD(l) understanding • Idea: backup an average of several returns: • e.g. backup half of 2-step and half of 4-step • Called a complex backup • Draw each component • Label with the weights for that component

More Related