Announcements

Announcements • Homework 5 due Tuesday, October 30 • Book Review due Tuesday, October 30 • Lab 3 due Thursday, November 1 CS 484 – Artificial Intelligence

Reinforcement Learning Lecture 11

Reinforcement Learning • Addresses the question of how an autonomous agent that senses and acts in its environment can learn to choose optimal actions to achieve its goals • Use reward or penalty to indicate the desirability of the resulting state • Example problems • control a mobile robot • learn to optimize operations in a factory • learn to play a board game CS 484 – Artificial Intelligence

RL Diagram Agent Reward Action State Environment Process: Goal: Learn to choose actions that maximize CS 484 – Artificial Intelligence

Simple Grid World • Markov Decision Process (MDP) • Agent perceives a set S of distinct states • Agent has a set A of actions that it can perform • Environment responds by giving the agent a reward rt = r(st, at) • Environment produces the succeeding state st+1 = δ(st, at) • Task of agent: Learn a policy  : S→A • (st) = at 0 100 0 G 0 0 0 0 0 0 100 0 0 r(s,a) (immediate reward) values CS 484 – Artificial Intelligence

Learning a policy • Need to learn a policy • Maximize reward over time • Define the cumulative value V(st) • Learn the optimal policy which maximizes V(st) for all states s 90 100 81 G 72 81 81 81 90 90 100 72 81 Q(s,a) values – expect rewards over time when γ = .9 CS 484 – Artificial Intelligence

Using values to find optimal policy G V*(s) values – the value of the highest expected reward from a state G One optimal policy CS 484 – Artificial Intelligence

Temporal Difference Learning • Learn iteratively by reducing the discrepancy between estimated values for adjacent states • Initially all values are zero • As an agent moves about the environment the values of states are updated according the following formula • where  is the reinforcement learning constant CS 484 – Artificial Intelligence

Calculating the Value of a State • Where does these values come from? • Use the Bellman equation G V*(s) values – the value of the highest expected reward from a state CS 484 – Artificial Intelligence

Our GridWorld • It is deterministic so the Bellman equation can be simplified • Need a policy (s,a) • Suppose the agent selects all actions with equal probability .5 .33 .33 G 1 (s,a) .5 .33 .5 .5 .33 .33 .5 .33 .5 CS 484 – Artificial Intelligence

Our GridWorld • Initialize all values to 0 • After one application of the Bellman equation G 0 G 0 CS 484 – Artificial Intelligence

Our GridWorld • Step 2 (use old value of s') • Step 3 G 0 G 0 CS 484 – Artificial Intelligence

Our GridWorld • Step 4 • … Step 58 G 0 G 0 CS 484 – Artificial Intelligence

Finding the Optimal Policy • Modify the Bellman equation from • to CS 484 – Artificial Intelligence

Our GridWorld • Initialize all values to 0 • After one application of the Bellman equation G 0 G 0 CS 484 – Artificial Intelligence

Our GridWorld • Step 2 (use old value of s') • Step 3 G 0 G 0 CS 484 – Artificial Intelligence

Other GridWorld Agent can move in 4 directions from each cell If agent moves off the grid, reward = -1 If agent is in State A, all moves take it to State A' and it receives a reward of +10 If agent is in State B, all moves take it to State B' and it receives a reward of +5 +5 +10 Values following a random policy Why is A valued less than 10 and B valued more than 5? CS 484 – Artificial Intelligence

Announcements

Announcements

Presentation Transcript

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

ANNOUNCEMENTS

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

ANNOUNCEMENTS

Announcements