Reinforcement Learning

1 / 14

# Reinforcement Learning - PowerPoint PPT Presentation

Reinforcement Learning. Mitchell, Ch. 13 (see also Barto & Sutton book on-line). Rationale. Learning from experience Adaptive control Examples not explicitly labeled, delayed feedback Problem of credit assignment – which action(s) led to payoff?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Reinforcement Learning

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Reinforcement Learning

Mitchell, Ch. 13

Rationale
• Learning from experience
• Examples not explicitly labeled, delayed feedback
• Problem of credit assignment – which action(s) led to payoff?
• tradeoff short-term thinking (immediate reward) for long-term consequences
Agent Model
• Transition function – T:SxA->S, environment
• Reward function R:SxA->real, payoff
• Stochastic but Markov
• Policy=decision function, p:S->A
• “rationality” – maximize long term expected reward
• Discounted long-term reward (convergent series)
• Alternatives: finite time horizon, uniform weights

=

Markov Decision Processes (MDPs)
• if know R and T(=P), solve for value func Vp(s)
• policy evaluation
• Bellman Equations
• dynamic programming (|S| eqns in |S| unknowns)
MDPs
• finding optimal policies
• Value iteration – update V(s) iteratively until p(s)=argmaxa Vp(s) stops changing
• Policy iteration – iterate between choosing p and updating V over all states
• Monte Carlo sampling: run random scenarios using p and take average rewards as V(s)
Q-learning: model-free
• Q-function: reformulate as value function of S and A, independent of R and T(=d)
Convergence
• Theorem: Q converges to Q*, after visiting each state infinitely often (assuming |r|<)
• Proof: with each iteration (where all SxA visited), magnitude of largest error in Q table decreases by at least g
Training
• “on-policy”
• exploitation vs. exploration
• will relevant parts of the space be explored if stick to current (sub-optimal) policy?
• e-greedy policies: choose action with max Q value most of the time, or random action e % of the time
• “off-policy”
• learn from simulations or traces
• SARSA: training example database: <s,a,r,s’,a’>
• Actor-critic
convergence is not the problem
• representation of large Q table is the problem (domains with many states or continuous actions)
• how to represent large Q tables?
• neural network
• function approximation
• basis functions
• hierarchical decomposition of state space