1 / 48

Reinforcement Learning

Reinforcement Learning. Vimal. The learner here is a decision making agent Keeps making decisions to take actions in the environment and receive rewards (or penalty ) A set of trial-and-error runs, the agent is expected to learn the best policy that maximize the total reward . .

cathy
Download Presentation

Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning Vimal

  2. The learner here is a decision making agent • Keeps making decisions to take actions in the environment and receive rewards(or penalty) • A set of trial-and-error runs, the agent is expected to learn the best policy that maximize the total reward.

  3. Example: Playing Chess • Supervised learner to play the game? • We need a costly teacher to teach the game • In many cases, we don’t have a single best move • Goodness of the move may depend on the move that follow • Sequence of moves is good, if we win

  4. Example: Robot in a maze • Task : Robot can move in one of four compass directions and should make a better sequence of moves to exit. • What is a best move here? • As long as the robot is in the maze, there is no feedback. • No opponent here in the environment (except the environment itself) • We may play against time?

  5. Example: Backgammon

  6. From these examples • A decision maker (agent) placed in an environment • The decision maker should learn to make decisions • At any time, the environment is in a certain state (one of many states) • The decision maker has a set of actions available • An action taken by an agent changes the state of the environment • Reward(or penalty) for choosing an action in a state • At times, rewards come late, after carrying out the complete sequence of actions

  7. The learner here is the agent (learning to make decisions) • Keeps making decisions to take actions in the environment and receive rewards(or penalty) • A set of trial-and-error runs, the agent is expected to learn the best policy that maximize the total reward.

  8. Reinforcement Learning Supervised Learning

  9. Supervised Learning (Learning with a teacher) • Reinforcement Learning(Learning with a critic) • Critic can tell, how well we have been doing in the past • Critic never informs anything ahead

  10. Supervised Learning (Learning with a teacher) • Reinforcement Learning(Learning with a critic) • Critic can tell, how well we have been doing in the past • Critic never informs anything ahead • Agent learns an internal value for intermediate states which reflects how good are they in the path leading to goal & getting real reward • With this agent can learn to take local actions and work to maximize rewards.

  11. The Big Picture Your action influences the state of the world which determines its reward

  12. Lets consolidate… • To Learn successful control policies by experimenting in their environment • Perform sequences of actions - observe their consequences - learn a control policy • Control policy: a policy which chooses actions that maximize the reward accumulated over time by the agent from a given initial state Control policy, π : S  A

  13. Episode(Trial) • The sequence of actions that takes us from an initial state to final state • There can be designated start states (or not) depending on the nature of problem • Repeat infinitely many trials for arriving at a policy (we will come to it later)

  14. Markov Decision Process (MDP) • MDP is a formal model of the RL problem • At each discrete time point • Agent observes state st and chooses actionat • Receives rewardrt from the environment and the state changes to st+1 • Markov assumption: rt=r(st,at) st+1=(st,at) i.e.rt and st+1 depend only on the current state and action • In general, the functions r and  may not be deterministic and are not necessarily known to the agent

  15. Agent’s Learning Task • To Learn action policy that produces the greatest possible cumulative reward for the robot over time • The cumulative reward starting from a state st is Here is the discount factor for future rewards. Generally the delayed rewards are exponentially discounted

  16. Alternative rewards • Finite Horizon Reward • Average reward (over entire life time)

  17. Agent’s Learning Task (contd) • Optimal Policy • We use V*(s) denote the value function for optimal policy

  18. Example Grid world environment • Six possible states • Arrows represent possible actions • G: goal state One optimal policy – denoted * The best thing to do in each state: Compute the values of the states for this policy – denoted V*

  19. Example

  20. Example

  21. Example: TD-Gammon • Immediate reward: +100 if win -100 if lose 0 for all other states • Trained by playing 1.5 million games against itself • Now approximately equal to the best human player

  22. Value function We will consider deterministic worlds first • Given a policy (adopted by the agent), define an evaluation function over states: • Property:

  23. Example Grid world environment • Six possible states • Arrows represent possible actions • G: goal state One optimal policy – denoted * What is the best thing to do when in each state? Compute the values of the states for this policy – denoted V*

  24. The Q Function • Maximum discounted cumulative reward that can be achieved starting from state s and applying action a as the first action • The optimal policy now is

  25. Why do we need Q function? • If the agent learns the Q function instead of the V* function then • we can select optimal actions even when it has no knowledge of the functions r and δ • Q function enables making decision without look ahead

  26. Why do we need Q function? • If the agent learns the Q function instead of the V* function then • we can select optimal actions even when it has no knowledge of the functions r and δ • Q function enables making decision without look ahead

  27. Q Learning

  28. Q Learning • Now, let denote the agent’s current approximation to Q. Consider the iterative update rule. Under some assumptions (<s,a> visited infinitely often), this will converge to the true Q:

  29. Q Learning algorithm (in deterministic worlds) • For each (s,a) initialise table entry • Observe current state s • Do forever: • Select an action a and execute it • Receive immediate reward r • Observe new state s’ • Update table entry as follows: • s:=s’

  30. Example updating Q given the Q values from a previous iteration on the arrows

  31. An Example

  32. Arrows indicate strength between two problem states Start maze … Start S2 S4 S3 S8 S7 S5 Goal

  33. The first response leads to S2 … The next state is chosen by randomly sampling from the possible next states weighted by their associative strength Associative strength = line width Start S2 S4 S3 S8 S7 S5 Goal

  34. Suppose the randomly sampled response leads to S3 … Start S2 S4 S3 S8 S7 S5 Goal

  35. At S3, choices lead to either S2, S4, or S7. S7 was picked (randomly) Start S2 S4 S3 S8 S7 S5 Goal

  36. By chance, S3 was picked next… Start S2 S4 S3 S8 S7 S5 Goal

  37. Next response is S4 Start S2 S4 S3 S8 S7 S5 Goal

  38. And S5 was chosen next (randomly) Start S2 S4 S3 S8 S7 S5 Goal

  39. And the goal is reached … Start S2 S4 S3 S8 S7 S5 Goal

  40. Goal is reached, strengthen the associative connection between goal state and last response Next time S5 is reached, part of the associative strength is passed back to S4... Start S2 S4 S3 S8 S7 S5 Goal

  41. Start maze again… Start S2 S4 S3 S8 S7 S5 Goal

  42. Let’s suppose after a couple of moves, we end up at S5 again Start S2 S4 S3 S8 S7 S5 Goal

  43. S5 is likely to lead to GOAL through strenghtened route In reinforcement learning, strength is also passed back to the last state This paves the way for the next time going through maze Start S2 S4 S3 S8 S7 S5 Goal

  44. The situation after lots of restarts … Start S2 S4 S3 S8 S7 S5 Goal

  45. Convergence Theorem :Q Learning • If each state-action pair is visited infinitely often, then Q^(s, a) converges to Q(s, a ) as n tends to infinity , for all s, a. • Proof: TB-1 Page 379

  46. Exploration versus Exploitation • The Q-learning algorithm doesn’t say how we could choose an action • If we choose an action that maximises our estimate of Q we could end up not exploring better alternatives • To converge on the true Q values we must favour higher estimated Q values but still have a chance of choosing worse estimated Q values for exploration. • An action selection function of the following form may employed, where k>0:

  47. Nondeterministic case • What if the reward and the state transition are not deterministic? – e.g. in Backgammon learning and playing depends on rolls of dice! • Then V and Q needs redefined by taking expected values • Similar reasoning and convergent update iteration will apply • Will continue next week.

  48. Summary • Reinforcement learning is suitable for learning in uncertain environments where rewards may be delayed and subject to chance • The goal of a reinforcement learning program is to maximise the eventual reward • Q-learning is a form of reinforcement learning that doesn’t require that the learner has prior knowledge of how its actions affect the environment

More Related