- By
**eron** - Follow User

- 295 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Reinforcement Learning' - eron

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Reinforcement Learning

- In previous chapters:
- Learning from examples.

- Reinforcement learning:
- Learning what to do.
- Learning to fly (a helicopter).
- Learning to play a game.
- Learning to walk.

- Learning based on rewards.

- Learning what to do.

Relation to MDPs

- Feedback can be provided at the end of the sequence of actions, or more frequently.
- Compare chess and ping-pong.

- No complete model of environment.
- Transitions may be unknown.

- Reward function unknown.

Agents

- Utility-based agent:
- Learns utility function on states.

- Q-learning agent:
- Learns utility function on (action, state) pairs.

- Reflex agent:
- Learns function mapping states to actions.

Passive Reinforcement Learning

- Assume fully observable environment.
- Passive learning:
- Policy is fixed (behavior does not change).
- The agent learns how good each state is.

- Similar to policy evaluation, but:
- Transition function and reward function or unknown.

- Why is it useful?

Passive Reinforcement Learning

- Assume fully observable environment.
- Passive learning:
- Policy is fixed (behavior does not change).
- The agent learns how good each state is.

- Similar to policy evaluation, but:
- Transition function and reward function or unknown.

- Why is it useful?
- For future policy revisions.

Direct Utility Estimation

- For each state the agent ever visits:
- For each time the agent visits the state:
- Keep track of the accumulated rewards from the visit onwards.

- For each time the agent visits the state:
- Similar to inductive learning:
- Learning a function on states using samples.

- Weaknesses:
- Ignores correlations between utilities of neighboring states.
- Converges very slowly.

Adaptive Dynamic Programming

- Learns transitions and state utilities.
- Plugs values into Bellman equations.
- Solves equations with linear algebra, or policy iteration.
- Problem:

Adaptive Dynamic Programming

- Learns transitions and state utilities.
- Plugs values into Bellman equations.
- Solves equations with linear algebra, or policy iteration.
- Problem:
- Intractable for large number of states.

- Example: backgammon.
- 1050 equations, with 1050 unknowns.

Temporal Difference

- Every time we make a transition from state s to state s’:
- Update utility of s’:
U[s’] = current observed reward.

- Update utility of s:
U[s] = (1-a)U[s] + a (r + g U[s’] ).

a: learning rate

r: previous reward

g: discount factor

- Update utility of s’:

Properties of Temporal Difference

- What happens when an unlikely transition occurs?

Properties of Temporal Difference

- What happens when an unlikely transition occurs?
- U[s] becomes a bad approximation of true utility.
- However, U[s] is rarely a bad approximation.

- Average value of U[s] converges to correct value.
- If a decreases over time, U[s] converges to correct value.

Hybrid Methods

- ADP:
- More accurate, slower, intractable for large numbers of states.

- TD:
- Less accurate, faster, tractable.

- An intermediate approach:

Hybrid Methods

- ADP:
- More accurate, slower, intractable for large numbers of states.

- TD:
- Less accurate, faster, tractable.

- An intermediate approach: Pseudo-experiences:
- Imagine transitions that have not happened.
- Update utilities according to those transitions.

Hybrid Methods

- Making ADP more efficient:
- Do a limited number of adjustments after each transition.
- Use estimated transition probabilities to identify the most useful adjustments.

Active Reinforcement Learning

- Using passive reinforcement learning, utilities of states and transition probabilities are learned.
- Those utilities and transitions can be plugged into Bellman equations.
- Problem?

Active Reinforcement Learning

- Using passive reinforcement learning, utilities of states and transition probabilities are learned.
- Those utilities and transitions can be plugged into Bellman equations.
- Problem?
- Bellman equations give optimal solutions given correct utility and transition functions.
- Passive reinforcement learning produces approximate estimates of those functions.

- Solutions?

Exploration/Exploitation

- The goal is to maximize utility.
- However, utility function is only approximately known.
- Dilemma: should the agent
- Maximize utility based on current knowledge, or
- Try to improve current knowledge.

Exploration/Exploitation

- The goal is to maximize utility.
- However, utility function is only approximately known.
- Dilemma: should the agent
- Maximize utility based on current knowledge, or
- Try to improve current knowledge.

- Answer:
- A little of both.

Exploration Function

- U[s] = R[s] + g max {f(Q(a,s), N(a, s))}.
R[s]: current reward.

g: discount factor.

Q(a,s): estimated utility of performing action a in state s.

N(a, s): number of times action a has been performed in state s.

f(u, n): preference according to utility and degree of exploration so far for (a, s).

- Initialization: U[s] = optimistically large value.

Q-learning

- Learning utility of state-action pairs.
U[s] = max Q(a, s).

- Learning can be done using TD:
Q(a,s) = (1-b) Q(a,s) + b(R(s) + g max(Q(a’, s’))

b: learning factor

g: discount factor

s’: next state

a’: possible action at next state

Generalization in Reinforcement Learning

- How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, …)?

Generalization in Reinforcement Learning

- How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, …)?
- Solution similar to estimating probabilities of a huge number of events:

Generalization in Reinforcement Learning

- How do we apply reinforcement learning problems with huge numbers of states (chess, backgammon, …)?
- Solution similar to estimating probabilities of a huge number of events:
- Learn parametric functions, where parameters are features of each state.

- Example: chess.
- 20 features adequate for describing the current board.

Learning Parametric Utility Functions For Backgammon

- First approach:
- Design weighted linear functions of 16 terms.
- Collect training set of board states.
- Ask human experts to evaluate training states.

- Result:
- Program not competitive with human experts.
- Collecting training data was very tedious.

Learning Parametric Utility Functions For Backgammon

- Second approach:
- Design weighted linear functions of 16 terms.
- Let the system play against itself.
- Reward provided at the end of each game.

- Result (after 300,000 games, a few weeks):
- Program competitive with best players in the world.

Download Presentation

Connecting to Server..