This presentation is the property of its rightful owner.
1 / 79

# Reinforcement Learning PowerPoint PPT Presentation

Reinforcement Learning. 16 January 2009 RG Knowledge Based Systems Hans Kleine Büning. Outline. Motivation Applications Markov Decision Processes Q-learning Examples. How to program a robot to ride a bicycle?. Reinforcement Learning: The Idea.

Reinforcement Learning

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Reinforcement Learning

16 January 2009

RG Knowledge Based Systems

Hans Kleine Büning

### Outline

• Motivation

• Applications

• Markov Decision Processes

• Q-learning

• Examples

How to program

a robot to ride

a bicycle?

### Reinforcement Learning: The Idea

• A way of programming agents by reward and punishment without specifying how the task is to be achieved

Environment

state

€€€

€€€

action

### Learning to Ride a Bicycle

States:

Angle of handle bars

Angular velocity of handle bars

Angle of bicycle to vertical

Angular velocity of bicycle to vertical

Acceleration of angle of bicycle to vertical

Environment

state

€€€

€€€

action

### Learning to Ride a Bicycle

Actions:

Torque to be applied to the handle bars

Displacement of the center of mass from the bicycle’s plan (in cm)

Environment

state

€€€

€€€

action

### Learning to Ride a Bicycle

Angle of bicycle to vertical is greater than 12°

no

yes

Reward = -1

Reward = 0

Reinforcement

Learning

### Reinforcement Learning: Applications

• Board Games

• TD-Gammon program, based on reinforcement learning, has become a world-class backgammon player

• Mobile Robot Controlling

• Learning to Drive a Bicycle

• Pole-balancing

• Acrobot

• Sequential Process Controlling

• Elevator Dispatching

### Key Features of Reinforcement Learning

• Learner is not told which actions to take

• Trial and error search

• Possibility of delayed reward:

• Sacrifice of short-term gains for greater long-term gains

• Considers the whole problem of a goal-directed agent interacting with an uncertain environment

### The Agent-Environment Interaction

• Agent and environment interact at discrete time steps: t = 0,1, 2, …

• Agent observes state at step t : st2 S

• produces action at step t: at2A

• gets resulting reward : rt +12 ℜ

• and resulting next state: st +12 S

### The Agent’s Goal:

• Coarsely, the agent’s goal is to get as much reward as it

can over the long run

Policy  is

• a mapping from states to action (s) = a

• Reinforcement learning methods specify how the agent changes its policy as a result of experience experience

P = 0.8

P = 0.1

P = 0.1

### Methods

Model (reward function and transition

probabilities) is known

Model (reward function or transition

probabilities) is unknown

discrete states

continuous

states

discrete states

continuous

states

Dynamic

Programming

Value

Function

Approximation

+

Dynamic

Programming

Reinforcement

Learning,

Monte Carlo

Methods

Valuation

Function

Approximation

+

Reinforcement

Learning

### Blackjack

• Standard rules of blackjack hold

• State space:

• element[0] - current value of player's hand (4-21)

• element[1] - value of dealer's face­-up card (2-11)

• element[2] - player does not have usable ace (0/1)

• Starting states:

• player has any 2 cards (uniformly distributed), dealer has any 1 card (uniformly distributed)

• Actions:

• HIT

• STICK

• Rewards:

• ­1 for a loss

• 0 for a draw

• 1 for a win

### Blackjack: Optimal Policy

States

Grids

Actions

Left

Up

Right

Down

Rewards

Bonus 20

Food 1

Predator -10

Empty grid -0.1

Transition probabilities

0.80 – agent goes where he intends to go

0.20 – to any other adjacent grid or remains where it was (in case he is on the board of the grid world he goes to the other side)