580.691 Learning Theory Reza Shadmehr & Jörn Diedrichsen

580.691 Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration

In a given situation the action an organism performs can be followed by a certain outcome, which might be rewarding or non-rewarding to the organism. Also, the action brings the organism into a new state, from which there may be again the possibility to obtain reward. How do we learn to choose the actions to maximize reward? This is the problem addressed by Reinforcement learning. In contrast to supervised learning, the reward we see only tells us whether the action we chose was good or bad, not what would have been the "correct" action. Also, the actions we take and the reward can be separated temporally, so that the problem arises how to assign the reward signal to actions. Thus, reinforcement learning has some aspects of supervised learning, but with a very "poor" teacher.

Reinforcement Learning: The lay of the land Say, you are a rat in a maze. At any given place you can go left or right, and you might can food as a result: wait right left left right right left The big circles are states. We say that the state at a time t has the value st, indicating that the rat is at a certain location. The small circles are actions, which the rat can take at any given state. Actions transport the actor into a new state. We define a policyp, a probabilistic mapping of states to action.

Reinforcement Learning: The lay of the land Often, the outcome of actions is not full certain. For example, when going left the rat might have a 10% chance to fall through a trap door. Once in the trap, the experimenter might free the rat with a probability of 20% on every time step: wait right left left right P=0.9 P=0.1 right left P=0.2 P=0.8 wait Thus, we define the transition probability, that you go to state s', when you are in state s and perform the action a.

wait r=6 right left left right P=0.9 P=0.1 right left P=0.2 P=0.8 r=6 wait r=1 Reinforcement Learning: reward r=8 If you are in a state s at time t and take action a, that brings you to another state at time t+1, the you receive the reward rt+1. In a discrete episode, Return will be defined as the sum of all the rewards we get from now to time T. The goal of reinforcement learning is to find policy p that maximizes the expected return from each state. This is the optimal policy. We can also define the expected reward.

Reinforcement Learning: episodic and continuous environments Now, life does not come in discrete episodes, but rather as a continuous stream of behavior. If we want to define Return in a continuous case like this (where there is no end T), we need to introduce temporal discounting. That is, reward we get right now is much better than reward we will get tomorrow. The return of the reward way be decrease exponentially. Temporal discounting can be demonstrated and shown in humans and animals: Given the choice between 2 food pellets now and 4 food pellets in 2 hrs, which one does the rat prefer. Would you rather have $10 now or $12 in a month? With temporal discounting we can write each episodic environment as a continuous one, by introducing nodes for the terminal states, that have a transition probability of 1 onto themselves.

Value function and Bellman equations Most reinforcement learning algorithms are based on estimating the Value function. The Value function is the expected Return of a state under a certain policy. The iterative definition of the Value function is known as the Bellman equation. We can write down a Bellman equation for each state. The value function then is the unique solution to these equations. A value function also has a action-value function attached, defined as the expected return, if performing action a from a state s. Correspondingly we can define Q, the action-value function:

Evaluating policies The first question a learner has to answer, is how good the current policy is. That is, we need a method that evaluates the value-function for a policy: Dynamic Programming Dynamic programming works nice to find a solution using a simple iterative scheme. For large state-spaces it can be beneficial to update only a subset of states in each iteration and then use these to update other states. BUT: in general dynamic programming requires that we know the transition probabilities P and the expected reward R.

Evaluating policies: Experience Monte Carlo Well, the learner often does not know the environment. So how can we learn the value-function? Experience sampling using Monte Carlo can be quite time-intensive, but does not require any knowledge of the environment.

Optimal Value function Now that we have a value function, we can define the optimal value function. This is the value function under the best policy, so that The optimal action-value function would be: How do we find the optimal value function if we only know the value function of the current policy?

Optimizing the policy How do we find the optimal value function if we only know the value function of the current policy? The key is to realize that if we change the policy at one state a from Such that and otherwise follow p Then: This is know as the policy improvement theorem. For example we can improve the policy greedily, by choosing for every state s the action a: We can now iterated policy evaluation and policy improvement, until the policy doesn’t change anymore. Due to the definition of the optimal Value function: By iterating on policy evaluation and policy improvement we find the optimal policy & value function. This is know as generalized policy iteration.

Generalized policy iterations We can alternate policy evaluations (E) and policy improvements (I), until convergence: 4.6 4.6 7.4 8.8 5.7 4.6 4.6 5.7 7.8 7.8 14.3 12.2 9.8 7.8 7.8 6.5

Exploration vs. Exploitation When the organism does not know the environment, but has to rely on sampling, the a greedy policy can get in the way of finding the value function of the policy. That is Exploitation can get in the way of Exploration. This will be your homework 1. Solution 1: without being maximally greedy, be e-greedy. That is, go for the maximum in 1-e parts of the cases and choose other options e of the cases. Solution 2: Do not go for maximum, but choose the probability following a softmax-function. Sometimes this is called the Gibbs/Boltzman distribution. The parameter t determines, how “soft” the selection is. If t->0 the softmax function approaches the maximal greedy selection. t is often called the temperature of the distribution. As the temperature decreases, the distribution “crystallizes” around one point, when the temperature rises, the distribution becomes more and more diffuse.

For computational purposes, we can write the policy and the transition properties as matrices. We borrow a formalism that is widely used for discrete stochastic processes. The state s becomes a vector of indicator variables. To do this we have to be careful how we define our actions. The probability for a transition to s must ONLY depend on the action taken, not on the last state. That is, when we have 5 states and can go left or right from each of them, we need to define 10 actions (go left from state 1 etc…..)

580.691 Learning Theory Reza Shadmehr & Jörn Diedrichsen