1 / 58

Tópicos Especiais em Aprendizagem

Tópicos Especiais em Aprendizagem. Prof. Reinaldo Bianchi Centro Universitário da FEI 2007. Objetivo desta Aula. Aprendizado por Reforço: Métodos de Monte Carlo. Aprendizado por Diferenças Temporais. Traços de Elegibilidade. Aula de hoje: capítulos 5, 6 e 7 do Sutton & Barto.

jadzia
Download Presentation

Tópicos Especiais em Aprendizagem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2007

  2. Objetivo desta Aula • Aprendizado por Reforço: • Métodos de Monte Carlo. • Aprendizado por Diferenças Temporais. • Traços de Elegibilidade. • Aula de hoje: capítulos 5, 6 e 7 do Sutton & Barto.

  3. Relembrando a aula passada.

  4. O que é o Aprendizado por Reforço? • Aprendizado por interação. • Aprendizado orientado a objetivos. • Aprendizado sobre, do e enquanto interagindo com um ambiente externo. • Aprender o que fazer: • Como mapear situações em ações. • Maximizando um sinal de recompensa numérico.

  5. Agente no AR • Situado no tempo. • Aprendizado e planejamento continuo. • Objetivo é modificar o ambiente. Ambiente Ação Estado Recompensa Agente

  6. Elementos do AR • Política (Policy): o que fazer. • Recompensa (Reward): o que é bom. • Valor (Value): o que é bom porque prevê uma recompensa. • Modelo (Model): o que causa o que. Policy Reward Value Model of environment

  7. r r r . . . . . . t +1 t +2 s s t +3 s s t +1 t +2 t +3 a a a a t t +1 t +2 t t +3 The Agent-Environment Interface

  8. The Agent Learns a Policy • Reinforcement learning methods specify how the agent changes its policy as a result of experience. • Roughly, the agent’s goal is to get as much reward as it can over the long run.

  9. Goals and Rewards • Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible. • A goal should specify what we want to achieve, not how we want to achieve it. • A goal must be outside the agent’s direct control—thus outside the agent. • The agent must be able to measure success: • explicitly; • frequently during its lifespan.

  10. Returns Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. where T is a final time step at which a terminal state is reached, ending an episode.

  11. Importante!!! • São muito diferentes: • Reward (rt): • O que ganha quando faz uma ação. • Return (Rt): • É o retorno esperado. • A relação entre um e outro pode ser: • Expected Return (E{Rt}): • É o que se deseja maximizar.

  12. Returns for Continuing Tasks Continuing tasks: interaction does not have natural episodes. Discounted return:

  13. The Markov Property • Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property:

  14. Defining a Markov Decision Processes • To define a finite MDP, you need to give: • state Sand action setsA(s). • one-step “dynamics” defined by transition probabilities: • reward expectations:

  15. Value Functions • The value of a state is the expected return starting from that state; depends on the agent’s policy. • State-value function for policy:

  16. Value Functions • The value of taking an action in a stateunder policy  is the expected return starting from that state, taking that action, and thereafter following . • Action-value function for policy :

  17. Bellman Equation for a Policy  The basic idea: So: Or, without the expectation operator:

  18. Policy Iteration policy evaluation policy improvement “greedification”

  19. Policy Iteration

  20. Value Iteration Recall the full policy evaluation backup: Here is the full value iteration backup:

  21. Value Iteration Cont.

  22. Fim da Revisão • Importante: • Conceitos básicos bem entendidos. • Problema: • DP necessita do modelo de transição de estados P. • Como resolver este problema, se o modelo não é conhecido?

  23. Métodos de Monte Carlo Capítulo 5 do Sutton e Barto.

  24. Monte Carlo Methods • Métodos de Monte Carlo permitem aprender a partir de exemplos de retornos completos (complete sample returns) • Definido para tarefas episódicas. • Métodos de Monte Carlo possibilitam o aprendizado baseado diretamente em experiências: • On-line: Não necessita de um modelo para atingir a solução ótima. • Simulated: Não necessita de um modelo completo.

  25. Wikipedia: Monte Carlo Definition • Monte Carlo methods are a widely used class of computationalalgorithms for simulating the behavior of various physical and mathematical systems. • They are distinguished from other simulation methods (such as molecular dynamics) by being stochastic, usually by using random numbers - as opposed to deterministic algorithms. • Because of the repetition of algorithms and the large number of calculations involved, Monte Carlo is needs large computer power.

  26. Wikipedia: Monte Carlo Definition • A Monte Carlo algorithm is a numerical Monte Carlo method used to find solutions to mathematical problems (which may have many variables) that cannot easily be solved, for example, by integral calculus, or other numerical methods. • For many types of problems, its efficiency relative to other numerical methods increases as the dimension of the problem increases.

  27. ? Lose Lose Win Lose Monte Carlo principle http://nlp.stanford.edu/local/talks/mcmc_2004_07_01.ppt • Consider the game of solitaire: what’s the chance of winning with a properly shuffled deck? • Hard to compute analytically because winning or losing depends on a complex procedure of reorganizing cards • Insight: why not just play a few hands, and see empirically how many do in fact win? • More generally, can approximate a probability density function using only samples from that density Chance of winning is 1 in 4!

  28. p(x) X Monte Carlo principle • Given a very large set X and a distribution p(x) over it • We draw a set of N samples • We can then approximate the distribution using these samples

  29. Monte Carlo principle • We can also use these samples to compute expectations • And even use them to find a maximum

  30. Monte Carlo Example: Approximation of (the number)... • If a circle of radius r = 1 is inscribed inside a square whit side length L = 2, then we obtain: http://twt.mpei.ac.ru/MAS/Worksheets/approxpi.mcd

  31. MC Example: Approximation of (the number)... • Inside the square, we can put N points at random with uniform distribution with (x,y) coordinates. • Now, we can to count how many points have fallen in the circle. http://twt.mpei.ac.ru/MAS/Worksheets/approxpi.mcd

  32. MC Example: Approximation of (the number)... • If N is large enough, we can think that the ratio: http://twt.mpei.ac.ru/MAS/Worksheets/approxpi.mcd

  33. MC Example: Approximation of (the number)... • For N = 1000: • NCircle = 768 • Pi = 3.072 • Error = 0.07

  34. MC Example: Approximation of (the number)... • For N = 10000: • NCircle = 7802 • Pi = 3.1208 • Error = 0.021

  35. MC Example: Approximation of (the number)... • For N = 100000: • NCircle = 78559 • Pi = 3.1426 • Error = 0.008

  36. 1 2 3 4 5 Monte Carlo Policy Evaluation • Goal: learn Vp(s) • Given: some number of episodes under p which contain s • Idea: Average returns observed after visits to s

  37. Monte Carlo Policy Evaluation • Every-Visit MC: average returns for every time s is visited in an episode • First-visit MC: average returns only for first time s is visited in an episode • Both converge asymptotically.

  38. First-visit Monte Carlo policy evaluation

  39. Blackjack example • Object: Have your card sum be greater than the dealers without exceeding 21. • States (200 of them): • current sum (12-21) • dealer’s showing card (ace-10) • do I have a useable ace? • Reward: +1 for winning, 0 for a draw, -1 for losing • Actions: stick (stop receiving cards), hit (receive another card) • Policy: Stick if my sum is 20 or 21, else hit

  40. Blackjack value functions

  41. Backup diagram for Monte Carlo • Entire episode included • Only one choice at each state (unlike DP) • MC does not bootstrap • Time required to estimate one state does not depend on the total number of states

  42. Monte Carlo Estimation of Action Values (Q) • Monte Carlo is most useful when a model is not available • We want to learn Q* • Qp(s,a) - average return starting from state s and action a following  • Also converges asymptotically if every state-action pair is visited • Exploring starts: Every state-action pair has a non-zero probability of being the starting pair

  43. Monte Carlo Control • MC policy iteration: Policy evaluation using MC methods followed by policy improvement • Policy improvement step: greedify with respect to value (or action-value) function

  44. Convergence of MC Control • Policy improvement theorem tells us: • This assumes exploring starts and infinite number of episodes for MC policy evaluation • To solve the latter: • update only to a given level of performance • alternate between evaluation and improvement per episode

  45. Monte Carlo Exploring Starts Fixed point is optimal policy * Proof is open question

  46. Blackjack example continued • Exploring starts • Initial policy as described before

  47. greedy non-max On-policy Monte Carlo Control • On-policy: learn about policy currently executing. • How do we get rid of exploring starts? • Need soft policies: p(s,a) > 0 for all s and a • e.g. e-soft policy: Similar to GPI: move policy towards greedy policy (i.e. e-soft) Converges to best e-soft policy

  48. On-policy Monte Carlo Control

  49. On-policy MC Control

More Related