490 likes | 629 Views
This lecture explores advanced decision-making techniques in intelligent systems, focusing on reinforcement learning (RL) and Monte Carlo methods. It covers the full RL problem, including the Markov Decision Process (MDP) model, dynamic programming (DP), and methods for RL without an MDP. Key topics include policy evaluation, Monte Carlo policy evaluation, action value estimation, and the distinctions between on-policy and off-policy methods. Examples such as Blackjack are used to illustrate these concepts, emphasizing the advantages of learning through direct interaction with environments.
E N D
Decision Making in Intelligent SystemsLecture 4 BSc course Kunstmatige Intelligentie 2008 Bram Bakker Intelligent Systems Lab Amsterdam Informatics Institute Universiteit van Amsterdam bram@science.uva.nl
Overview of this lecture • Solving the full RL problem • Given that we have the MDP model • Dynamic Programming (DP) • Given that we do not have the MDP model • Monte Carlo (MC) methods • Temporal Difference (TD) methods
Summary of last lecture: dynamic programming • Policy evaluation: estimate value of current policy • Policy improvement: determine new current policy by being greedy w.r.t. current value function • Policy iteration: alternate the above two processes • Value iteration: combine the above two processes in 1 step • Bootstrapping: updating estimates based on your own estimates • Full backups (to be contrasted with sample backups) • Generalized Policy Iteration (GPI)
Monte Carlo Methods • Monte Carlo methods learn • Learn from complete sample returns • Only defined for episodic tasks • Monte Carlo methods learn directly from experience • On-line: No model necessary and still attains optimality • Simulated: No need for a full model
1 2 3 4 5 Monte Carlo Policy Evaluation • Goal: learn Vp(s) • Given: some number of episodes under p which contain s • Idea: Average returns observed after visits to s • Every-Visit MC: average returns for every time s is visited in an episode • First-visit MC: average returns only for first time s is visited in an episode • Both converge asymptotically
T T T T T T T T T T T T T T T T T T T T Monte Carlo
T T T T T T T T T T cf. Dynamic Programming T T T
Blackjack example • Object: Have your card sum be greater than the dealers without exceeding 21. • States (200 of them): • current sum (12-21) • dealer’s showing card (ace-10) • do I have a useable ace? • Reward: +1 for winning, 0 for a draw, -1 for losing • Actions: stick (stop receiving cards), hit (receive another card) • Initial Policy: Stick if my sum is 20 or 21, else hit
Monte Carlo Estimation of Action Values (Q) • Monte Carlo is most useful when a model is not available • We want to values learn Q* : state-action values • Qp(s,a) - average return starting from state s and action a following p • Also converges asymptotically if every state-action pair is visited • Exploring starts: Every state-action pair has a non-zero probability of being the starting pair
Monte Carlo Control • MC policy iteration: Policy evaluation using MC methods followed by policy improvement • Policy improvement step: greedify with respect to value (or action-value) function
Blackjack example continued • Exploring starts • Initial policy as described before
On-policy Monte Carlo Control • On-policy: learn about policy currently executing • So, learn about policy which explores • Simplest: policy with exploring starts • More sophisticated: soft policies: p(s,a) > 0 for all s and a
Off-policy Monte Carlo control • Behavior policy generates behavior in environment • Estimation policy is policy being learned about • Usually: try to estimate best deterministic policy, based on policy which keeps exploring • Weigh returns from behavior policy to do value estimation of states of the estimation policy • Weigh by estimating relative probabilities of this state-action pair occurring
Incremental Implementation • MC can be implemented incrementally • saves memory • Compute the weighted average of each return non-incremental incremental equivalent
Simple incremental implementation target: the actual return after time t
Summary of MC • MC has several advantages over DP: • Can learn directly from interaction with environment • No need for full models • No need to learn about ALL states • Less harm by Markovian violations (later in book) • MC methods provide an alternate policy evaluation process • One issue to watch for: maintaining sufficient exploration • exploring starts, soft policies • Introduced distinction between on-policy and off-policy methods • No bootstrapping (as opposed to DP)
Overview of this lecture • Solving the full RL problem • Given that we have the MDP model • Dynamic Programming (DP) • Given that we do not have the MDP model • Monte Carlo (MC) methods • Temporal Difference (TD) methods
Temporal Difference (TD) methods • Learn from experience, like MC • Can learn directly from interaction with environment • No need for full models • Estimate values based on estimated values of next states, like DP • Bootstrapping (like DP) • Issue to watch for: maintaining sufficient exploration
Bootstrapping A full policy-evaluation backup (DP): The new estimated value for each state s is based on the estimated old value of all possible successor states s’ Bootstrapping: estimating values based on your own estimates of values
TD Prediction Policy Evaluation (the prediction problem): for a given policy p, compute the state-value function Recall: target: the actual return after time t target: an estimate of the return
T T T T T T T T T T T T T T T T T T T T Simple Monte Carlo
T T T T T T T T T T Dynamic Programming T T T
T T T T T T T T T T T T T T T T T T T T Simplest TD Method
TD methods bootstrap and sample • Bootstrapping: update involves an estimate • MC does not bootstrap • DP bootstraps • TD bootstraps • Sampling: update does not involve an expected value • MC samples • DP does not sample • TD samples
Driving Home • Changes recommended by Monte Carlo methods (a=1) • Changes recommended • by TD methods (a=1)
Advantages of TD Learning • TD methods do not require a model of the environment, only experience • TD, but not MC, methods can be fully incremental • You can learn before knowing the final outcome • Less memory • Less peak computation • You can learn without the final outcome • From incomplete sequences • Both MC and TD converge (under certain assumptions to be detailed later), but which is faster?
Random Walk Example • Values learned by TD(0) after • various numbers of episodes
TD and MC on the Random Walk • Data averaged over • 100 sequences of episodes
Optimality of TD(0) Batch Updating: Compute updates according to TD(0), but only update estimates after each complete pass through the data (end of the episode). Let’s assume we train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small a. Constant-a MC also converges under these conditions, but to a difference answer!
You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0
You are the Predictor • The prediction that best matches the training data is V(A)=0 • This minimizes the mean-square-error on the training set • This is what a batch Monte Carlo method gets • If we consider the sequentiality of the problem, then we would set V(A)=.75 • This is correct for the maximum likelihood estimate of a Markov model generating the data • i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) • This is called the certainty-equivalence estimate • This is what TD(0) gets
Sarsa: On-Policy TD Control Turn TD into a control method by always updating the policy to be (epsilon-)greedy with respect to the current estimate. S A R S A: State Action Reward State Action
Windy Gridworld undiscounted, episodic, reward = –1 until goal
Cliffwalking • e-greedy, e = 0.1
The Book • Part I: The Problem • Introduction • Evaluative Feedback • The Reinforcement Learning Problem • Part II: Elementary Solution Methods • Dynamic Programming • Monte Carlo Methods • Temporal Difference Learning • Part III: A Unified View • Eligibility Traces • Generalization and Function Approximation • Planning and Learning • Dimensions of Reinforcement Learning • Case Studies
Summary • TD prediction • Introduced one-step tabular model-free TD methods • Extend prediction to control by employing some form of GPI • On-policy control: Sarsa • Off-policy control: Q-learning • These methods bootstrap and sample, combining aspects of DP and MC methods
Questions • What is common to all three classes of methods? – DP, MC, TD • What are the principle strengths and weaknesses of each? • In what sense is our RL view complete? • In what senses is it incomplete? • What are the principal things missing? • The broad applicability of these ideas… • What does the term bootstrapping refer to? • What is the relationship between DP and learning?
Next Class • Lecture next Monday, 5 March : • Chapters 6 & 7 of Sutton & Barto