CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning

CPSC 7373: Artificial IntelligenceLecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock

Reinforcement Learning • In MDP, we learned how to determine an optimal sequence of actions for an agent in a stochastic environment. • An agent that knows the correct model of the environment can navigate, finding its ways to the positive rewards and avoiding the negative penalties. • Reinforcement learning: • can guide the agent to an optimal policy, even though he doesn't know anything about the rewards when he starts out.

Reinforcement Learning 1 2 3 4 a b c • What if we don’t know where the +100 and -100 regards are when we start? • A reinforcement learning agent can learn to explore the territory, find where the rewards are, and then learn an optimal policy. • An MDP solver can only do that once it knows exactly where the rewards are

RL Example • Backgammon is a stochastic game • In the 1990s, Gary Tesauro at IBM wrote a program to play backgammon. • #1: tried to learn the utility of a Game state, using examples that were labeled by human expert backgammon players. • only a small number of states were labeled • The program tried to generalize from the labels, using supervised learning • #2: no human expertise and no supervision. • 1 copy of the program play against another; and at the end of the game, the winner got a positive reward, and the loser, a negative. • perform at the level of the very best players in the world; learning from examples of about 200,000 games

Forms of Learning • Supervised: • (x1, y1), (x2, y2) … • y = f(x) • Unsupervised: • X1, x2, … • P(X=x) • Reinforcement: • s, a, s, a, …; r • Optimal policy: what is the right thing to do in any of the states

Forms of Learning • Examples: (S, U, R) • Speech Recognition: • examples of voice recordings, and then the transcript's intermittent text for each of those recordings; • from them, I try to learn a model of language. • Star data: • for each star, a list of all the different emission frequencies of light coming to earth • analyzing the spectral emissions of stars and trying to find clusters of stars in dissimilar types that may be of interest to astronomers. • Lever pressing: • a rat who is trained to press a lever to get a release of food when certain conditions are met • Elevator controller: • a sequence of button presses, and the wait time that we are trying to minimize • a bank of elevators in a building and they have to have some program, some policy, to decide which elevator goes up and which elevator goes down in response to the percepts, which would be the button presses at various floors in the building.

MDP Review • Markov Decision Processes: • List of States: S1, …, Sn; • List of Actions: a1, …, ak; • State Transition Matrix: T(S, a, S’) = P(S’|a,S) • Reward function: R(S’) / R(S, a, S’) • Finding optimal policy: pi(s) • look at all possible actions; choose the best one, according to the expected, in terms of probability utility.

Agents of RL • Problem with MDP: • Unknown: R?? P?? • Utility-based agent: • Learn R from P, and use P, R to learn the utility function U -> MDP • Q-learning agent: • Learn a utility function Q(s, a) over a pair of state and action. • Reflex agent: • Learn the policy (stimulus response)

Passive and Active • Passive RL: the agent has a fixed policy and executes that policy. • e.g., Your friend are driving from Little Rock to Dallas, you learn the R (a shortcut), but you can’t change your friend’s driving behavior (policy). • Active RL: change the policy as progressing • e.g., You take over the control of the car, and adjust the policy based on what you have learned. • It also gives you the possibility to explore.

Passive Temporal Difference Learning 1 2 3 4 a b c • Π, U(s), N(s), r • If s’ is new then U[s’] <- r’ • If s is not null then • Increment Ns[s] • U[s] <- U[s] + α(Ns[s])(r+γU[s’]-U[s]) • α(): learning rate (e.g., 1/(N+1))

Passive Agent Results

Weakness 1 2 3 4 a b c Problem: Fixed policy!!!

Active RL: Greedy • π<- π’: after utility update, recomputethe new optimal policy. • How should the agent behave? Choose action with highest expected utility? • Exploration vs. exploitation: occasionally try “suboptimal” actions!! • Random?

Errors in Utility • Tracking Π, U(s), N(s) • Reasons for errors: • Not enough samples (random fluctuations) • Not a good policy • Questions: • Make U too low ? • Make U too high ? • Improved with more N ?

Exploration Agents • An exploration agent that will: • be more proactive about exploring the world when it's uncertain, and • fall back to exploiting the (sub-)optimal policy, when it becomes more certain about the world. • If s’ is new then U[s’] <- r’ • If s is not null then • Increment Ns[s] • U[s] <- U[s] + α(Ns[s])(r+γU[s’]-U[s]) U(s) = +R, when Ns < e U(s)

Exploratory agent results

Q-Learning 1 2 3 4 • U -> Π: • policy for each state is determined by the expected value • Unknown P? • Q-learning a b c

Q-learning • Q(s,a) <- Q(s,a) + α(R(s) + γQ(s’,a’) – Q(s,a)) 1 2 3 4 a 0 0 0 0 0 0 0 0 0 b 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 c

Conclusion • Know P, we can learn R, and derive U -> MDP • Don’t know P or R, we can use Q-learning, where use Q(s,a) as a utility function. • We learned the trade-off between exploration and exploitation.

CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning