1 / 13

Reinforcement Learning

Reinforcement Learning. Ata Kaban A.Kaban@cs.bham.ac.uk School of Computer Science University of Birmingham. Learning by reinforcement. Examples: Learning to play Backgammon Robot learning to dock on battery charger Characteristics: No direct training examples – delayed rewards instead

kai-walsh
Download Presentation

Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reinforcement Learning Ata Kaban A.Kaban@cs.bham.ac.uk School of Computer Science University of Birmingham

  2. Learning by reinforcement • Examples: • Learning to play Backgammon • Robot learning to dock on battery charger • Characteristics: • No direct training examples – delayed rewards instead • Need for exploration & exploitation • The environment is stochastic and unknown • The actions of the learner affect future rewards

  3. Brief history & successes • Minsky’s PhD thesis (1954): Stochastic Neural-Analog Reinforcement Computer • Analogies with animal learning and psychology • TD-Gammon (Tesauro, 1992) – big success story • Job-shop scheduling for NASA space missions (Zhang and Dietterich, 1997) • Robotic soccer (Stone and Veloso, 1998) – part of the world-champion approach • ‘An approximate solution to a complex problem can be better than a perfect solution to a simplified problem’

  4. The RL problem States Actions Immediate rewards Eventual reward Discount factor from any starting state

  5. Markov Decision Process (MDP) • MDP is a formal model of the RL problem • At each discrete time point • Agent observes state st and chooses actionat • Receives rewardrt from the environment and the state changes to st+1 • Markov assumption: rt=r(st,at) st+1=(st,at) i.e. rt and st+1 depend only on the current state and action • In general, the functions r and  may not be deterministic and are not necessarily known to the agent

  6. Agent’s Learning Task Execute actions in environment, observe results and • Learn action policy that maximises from any starting state in S. Here is the discount factor for future rewards • Note: • Target function is • There are no training examples of the form (s,a) but only of the form ((s,a),r)

  7. Example: TD-Gammon • Immediate reward: +100 if win -100 if lose 0 for all other states • Trained by playing 1.5 million games against itself • Now approximately equal to the best human player

  8. States: position and velocity Actions: accelerate forward, accelerate backward, coast Rewards Reward=-1for every step, until the car reaches the top Reward=1 at the top, 0 otherwise, <1 The eventual reward will be maximised by minimising the number of steps to the top of the hill Example: Mountain-Car

  9. Q Learning algorithm (in deterministic worlds) • For each (s,a) initialise table entry • Observe current state s • Do forever: • Select an action a and execute it • Receive immediate reward r • Observe new state s’ • Update table entry as follows: • s:=s’

  10. Example updating Q given the Q values from a previous iteration on the arrows

  11. Exploration versus Exploitation • The Q-learning algorithm doesn’t say how we could choose an action • If we choose an action that maximises our estimate of Q we could end up not exploring better alternatives • To converge on the true Q values we must favour higher estimated Q values but still have a chance of choosing worse estimated Q values for exploration (see the convergence proof of the Q-learning algorithm in [Mitchell, sec. 13.3.4.]). An action selection function of the following form may employed, where k>0:

  12. Summary • Reinforcement learning is suitable for learning in uncertain environments where rewards may be delayed and subject to chance • The goal of a reinforcement learning program is to maximise the eventual reward • Q-learning is a form of reinforcement learning that doesn’t require that the learner has prior knowledge of how its actions affect the environment

  13. Further topics: Nondeterministic case • What if the reward and the state transition are not deterministic? – e.g. in Backgammon learning and playing depends on rolls of dice! • Then V and Q needs redefined by taking expected values • Similar reasoning and convergent update iteration will apply • Will continue next week.

More Related