1 / 19

Using MDP Characteristics to Guide Exploration in Reinforcement Learning

Using MDP Characteristics to Guide Exploration in Reinforcement Learning. Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas gratefully borrowed from slides by Ratich. MDP Terminology. Transition Probabilities - P a s,s’ Expected reward - R a s,s’ Return.

Download Presentation

Using MDP Characteristics to Guide Exploration in Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas gratefully borrowed from slides by Ratich

  2. MDP Terminology • Transition Probabilities - Pas,s’ • Expected reward - Ras,s’ • Return

  3. Reinforcement Learning • Learning only on environmental rewards • Achieve the best payoff possible • Must balance exploitation with exploration • exploration can take large amounts of time • The structure of the problem/model can assist in the exploration, in theory • But with what in our MDP case?

  4. Goals/Approach • Find MDP Characteristics... • ... that affect performance... • ... and test on them. • Use MDP Characteristics... • ... to tune parameters. • ... to select algorithms. • ... to create strategy.

  5. Back to RL • Undirected • Sufficient Exploration • Simple, but can be exponential • Directed • Extra Computation/Storage, but possibly polynomial • Often uses aspects of the model to its advantage

  6. RL Methods - Undirected • -greedy exploration • Probability 1-  of exploiting based on your best greedy guess at the moment • Explore with probability , select action (uniform) randomly • Boltzman Distribution

  7. RL Methods - Directed • Maximize w/Exploration Bonuses • Different options for  • Counter-based (least frequently) • Recency-based (most frequently) • Error-based (most variable in estimation value) • Interval Estimation (highest variance in samples)

  8. Properties of MDPs • State Transition Entropy • Controllability • Variance of Immediate Rewards • Risk Factor • Transition Distance • Transition Variability

  9. State Transition Entropy • Stochasticity of State Transitions • High STE = good exploration • Potential variance of samples needed • High STE = more samples needed

  10. Controllability - Calculation • How much the environment’s response differs for an action • Can also be thought of as normalized information gain

  11. Controllability - Usage • High controllability • Control over actions • Different actions lead to different parts of the space • More variance = more sampling needed • Take actions leading to controllable states • Actions with Forward Controllability (FC)

  12. Proposed Method • Undirected • Explore w/ probability  • For experiments • K1, K2 = {0,1}  = 1,  = {0.1, 0.4, 0.9}

  13. Proposed Method • Directed • Pick action maximizing • For Experments • K0 = {1, 10, 50}, K1, K2 = {0,1}, K3 = 1 •  is recency based

  14. Experiments • Random MDPs • 225 states • 3 actions • 1-20 branching factor • transition probs/rewards uniform [0,1] • 0.01 chance of termination • Divided into 4 groups • Low STE, High STE • High variation (test) vs. low variation (control)

  15. Experiments Continued • Performance Measures • Return Estimates • Run greedy policy from 50 different states, 30 trials per state, average returns, normalize • Penalty Measure • Rmax = upper limit of return of optimal • Rt is normalized greedy return after trial t • T = # of trials

  16. Graphs, Glorious Graphs

  17. More Graphs, Glorious Graphs

  18. Discussion • Significant results obtained when using STE and FC • Results correspond with presence of STC • Values can be calculated prior to learning • Requires model knowledge • Rug Sweeping and more judgements • SARSA

  19. It’s over!

More Related