Using MDP Characteristics to Guide Exploration in Reinforcement Learning

Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas gratefully borrowed from slides by Ratich

MDP Terminology • Transition Probabilities - Pas,s’ • Expected reward - Ras,s’ • Return

Reinforcement Learning • Learning only on environmental rewards • Achieve the best payoff possible • Must balance exploitation with exploration • exploration can take large amounts of time • The structure of the problem/model can assist in the exploration, in theory • But with what in our MDP case?

Goals/Approach • Find MDP Characteristics... • ... that affect performance... • ... and test on them. • Use MDP Characteristics... • ... to tune parameters. • ... to select algorithms. • ... to create strategy.

Back to RL • Undirected • Sufficient Exploration • Simple, but can be exponential • Directed • Extra Computation/Storage, but possibly polynomial • Often uses aspects of the model to its advantage

RL Methods - Undirected • -greedy exploration • Probability 1-  of exploiting based on your best greedy guess at the moment • Explore with probability , select action (uniform) randomly • Boltzman Distribution

RL Methods - Directed • Maximize w/Exploration Bonuses • Different options for  • Counter-based (least frequently) • Recency-based (most frequently) • Error-based (most variable in estimation value) • Interval Estimation (highest variance in samples)

Properties of MDPs • State Transition Entropy • Controllability • Variance of Immediate Rewards • Risk Factor • Transition Distance • Transition Variability

State Transition Entropy • Stochasticity of State Transitions • High STE = good exploration • Potential variance of samples needed • High STE = more samples needed

Controllability - Calculation • How much the environment’s response differs for an action • Can also be thought of as normalized information gain

Controllability - Usage • High controllability • Control over actions • Different actions lead to different parts of the space • More variance = more sampling needed • Take actions leading to controllable states • Actions with Forward Controllability (FC)

Proposed Method • Undirected • Explore w/ probability  • For experiments • K1, K2 = {0,1}  = 1,  = {0.1, 0.4, 0.9}

Proposed Method • Directed • Pick action maximizing • For Experments • K0 = {1, 10, 50}, K1, K2 = {0,1}, K3 = 1 •  is recency based

Experiments • Random MDPs • 225 states • 3 actions • 1-20 branching factor • transition probs/rewards uniform [0,1] • 0.01 chance of termination • Divided into 4 groups • Low STE, High STE • High variation (test) vs. low variation (control)

Experiments Continued • Performance Measures • Return Estimates • Run greedy policy from 50 different states, 30 trials per state, average returns, normalize • Penalty Measure • Rmax = upper limit of return of optimal • Rt is normalized greedy return after trial t • T = # of trials

Graphs, Glorious Graphs

More Graphs, Glorious Graphs

Discussion • Significant results obtained when using STE and FC • Results correspond with presence of STC • Values can be calculated prior to learning • Requires model knowledge • Rug Sweeping and more judgements • SARSA

It’s over!

Using MDP Characteristics to Guide Exploration in Reinforcement Learning