reinforcement learning l.
Skip this Video
Loading SlideShow in 5 Seconds..
Reinforcement Learning PowerPoint Presentation
Download Presentation
Reinforcement Learning

Loading in 2 Seconds...

play fullscreen
1 / 14

Reinforcement Learning - PowerPoint PPT Presentation

  • Uploaded on

Reinforcement Learning. Mitchell, Ch. 13 (see also Barto & Sutton book on-line). Rationale. Learning from experience Adaptive control Examples not explicitly labeled, delayed feedback Problem of credit assignment – which action(s) led to payoff?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Reinforcement Learning' - tate

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
reinforcement learning

Reinforcement Learning

Mitchell, Ch. 13

(see also Barto & Sutton book on-line)

  • Learning from experience
  • Adaptive control
  • Examples not explicitly labeled, delayed feedback
  • Problem of credit assignment – which action(s) led to payoff?
  • tradeoff short-term thinking (immediate reward) for long-term consequences
agent model
Agent Model
  • Transition function – T:SxA->S, environment
  • Reward function R:SxA->real, payoff
  • Stochastic but Markov
  • Policy=decision function, p:S->A
  • “rationality” – maximize long term expected reward
    • Discounted long-term reward (convergent series)
    • Alternatives: finite time horizon, uniform weights


markov decision processes mdps
Markov Decision Processes (MDPs)
  • if know R and T(=P), solve for value func Vp(s)
  • policy evaluation
  • Bellman Equations
  • dynamic programming (|S| eqns in |S| unknowns)
  • finding optimal policies
  • Value iteration – update V(s) iteratively until p(s)=argmaxa Vp(s) stops changing
  • Policy iteration – iterate between choosing p and updating V over all states
  • Monte Carlo sampling: run random scenarios using p and take average rewards as V(s)
q learning model free
Q-learning: model-free
  • Q-function: reformulate as value function of S and A, independent of R and T(=d)
  • Theorem: Q converges to Q*, after visiting each state infinitely often (assuming |r|<)
  • Proof: with each iteration (where all SxA visited), magnitude of largest error in Q table decreases by at least g
  • “on-policy”
    • exploitation vs. exploration
    • will relevant parts of the space be explored if stick to current (sub-optimal) policy?
    • e-greedy policies: choose action with max Q value most of the time, or random action e % of the time
  • “off-policy”
    • learn from simulations or traces
    • SARSA: training example database: <s,a,r,s’,a’>
  • Actor-critic
convergence is not the problem
  • representation of large Q table is the problem (domains with many states or continuous actions)
  • how to represent large Q tables?
    • neural network
    • function approximation
    • basis functions
    • hierarchical decomposition of state space