1 / 17

Announcements

Announcements. Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1. Reinforcement Learning. Lecture 11. Reinforcement Learning.

summer-barr
Download Presentation

Announcements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Announcements • Homework 5 due Tuesday, October 30 • Book Review due Tuesday, October 30 • Lab 3 due Thursday, November 1 CS 484 – Artificial Intelligence

  2. Reinforcement Learning Lecture 11

  3. Reinforcement Learning • Addresses the question of how an autonomous agent that senses and acts in its environment can learn to choose optimal actions to achieve its goals • Use reward or penalty to indicate the desirability of the resulting state • Example problems • control a mobile robot • learn to optimize operations in a factory • learn to play a board game CS 484 – Artificial Intelligence

  4. RL Diagram Agent Reward Action State Environment Process: Goal: Learn to choose actions that maximize CS 484 – Artificial Intelligence

  5. Simple Grid World • Markov Decision Process (MDP) • Agent perceives a set S of distinct states • Agent has a set A of actions that it can perform • Environment responds by giving the agent a reward rt = r(st, at) • Environment produces the succeeding state st+1 = δ(st, at) • Task of agent: Learn a policy  : S→A • (st) = at 0 100 0 G 0 0 0 0 0 0 100 0 0 r(s,a) (immediate reward) values CS 484 – Artificial Intelligence

  6. Learning a policy • Need to learn a policy • Maximize reward over time • Define the cumulative value V(st) • Learn the optimal policy which maximizes V(st) for all states s 90 100 81 G 72 81 81 81 90 90 100 72 81 Q(s,a) values – expect rewards over time when γ = .9 CS 484 – Artificial Intelligence

  7. Using values to find optimal policy G V*(s) values – the value of the highest expected reward from a state G One optimal policy CS 484 – Artificial Intelligence

  8. Temporal Difference Learning • Learn iteratively by reducing the discrepancy between estimated values for adjacent states • Initially all values are zero • As an agent moves about the environment the values of states are updated according the following formula • where  is the reinforcement learning constant CS 484 – Artificial Intelligence

  9. Calculating the Value of a State • Where does these values come from? • Use the Bellman equation G V*(s) values – the value of the highest expected reward from a state CS 484 – Artificial Intelligence

  10. Our GridWorld • It is deterministic so the Bellman equation can be simplified • Need a policy (s,a) • Suppose the agent selects all actions with equal probability .5 .33 .33 G 1 (s,a) .5 .33 .5 .5 .33 .33 .5 .33 .5 CS 484 – Artificial Intelligence

  11. Our GridWorld • Initialize all values to 0 • After one application of the Bellman equation G 0 G 0 CS 484 – Artificial Intelligence

  12. Our GridWorld • Step 2 (use old value of s') • Step 3 G 0 G 0 CS 484 – Artificial Intelligence

  13. Our GridWorld • Step 4 • … Step 58 G 0 G 0 CS 484 – Artificial Intelligence

  14. Finding the Optimal Policy • Modify the Bellman equation from • to CS 484 – Artificial Intelligence

  15. Our GridWorld • Initialize all values to 0 • After one application of the Bellman equation G 0 G 0 CS 484 – Artificial Intelligence

  16. Our GridWorld • Step 2 (use old value of s') • Step 3 G 0 G 0 CS 484 – Artificial Intelligence

  17. Other GridWorld Agent can move in 4 directions from each cell If agent moves off the grid, reward = -1 If agent is in State A, all moves take it to State A' and it receives a reward of +10 If agent is in State B, all moves take it to State B' and it receives a reward of +5 +5 +10 Values following a random policy Why is A valued less than 10 and B valued more than 5? CS 484 – Artificial Intelligence

More Related