310 likes | 444 Views
Decision Making Under Uncertainty Lec #9: Approximate Value Function. UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006. Some slides by Jeremy Wyatt (U Birmingham), Ron Parr (Duke), Eduardo Alonso, and Craig Boutilier (Toronto). So Far. MDPs, RL Transition matrix (nxn)
E N D
Decision Making Under UncertaintyLec #9: Approximate Value Function UIUC CS 598: Section EA Professor: Eyal AmirSpring Semester 2006 Some slides by Jeremy Wyatt (U Birmingham), Ron Parr (Duke), Eduardo Alonso, and Craig Boutilier (Toronto)
So Far • MDPs, RL • Transition matrix (nxn) • Reward vector (n entries) • Value function – table (n entries) • Policy – table (n entries) • Time to solve is linear programming with O(n) variables • Problem: Number of states (n) too large in many systems
Today: Approximate Values • Motivation: • Representation of value function is smaller • Generalization: fewer state-action pairs must be observed • Some spaces (e.g., continuous) must be approximated • Function approximation • Have a parametrized representation for value function
Overview: Approximate Reinforcement Learning • Consider V^( ; ): X R, where denotes the vectors of parameters of the approximator • Want to learn so that V^ approximates V* well • Gradient-descent: find direction in which V^( ; ) can change to improve performance the most • Use step size, : V^(s; new) = V^(s; ) +
Overview: Approximate Reinforcement Learning • In RL, at each time step, the learning system uses a function approximator to select an action a= ^(s; ) • Should adjust so as to produce the maximum possible value for each s • Would like ^ to solve the parametric global optimization problem ^(s; ) = maxaA Q(s,a) sS
Supervised learning andReinforcement learning • SL: s and a = (s) are given • Regression uses direction =a – ^(s; ). • SL: we can check =0 and decide if the correct map is given by ^ at s. • RL: direction is not available • RL: cannot make conclusion on correctness without exploring the values of V(s,a) for all a
How to measure the performance of FA methods? • Regression SL minimizes the mean-square error (MSE) over some distribution P of the inputs • Our value prediction problem: inputs = states, and target function = V, so MSE(t ) = sS P(s)[V(s)- Vt(s)]2 • P is important because it is usually not possible to reduce the error to zero at all states.
More on MSE • Convergence results available to on-policy distributions with frequency of states that are encountered while interacting with environment • Question: why minimizing MSE? • Our goal is to make predictions to aid in finding a better policy. The best predictions for that purpose are not necessarily the best for minimizing MSE • There is no better alternative at the moment
Gradient-Descent for Reinforcement Learning • t parameter vector of real valued components • Vt (s) is smooth differentiable function of t for all sS • Minimize error on observed example <st,V(st)>: Adjust t by a small amount in the direction that reduces the error the most on that example t+1 = t + [V(st) - Vt(st)] t Vt(st) • Where t Vt(st) denotes the derivative vector of the gradient of Vt(st) with respect to t
Value Prediction • If vt, the t-th training example, is not the true value, V(st), but some approximation: t+1 = t + [vt- Vt(st)] t Vt(st) • Similarly, for n-step returns t+1 = t + [R()t- Vt(st)] t Vt(st), or t+1 = t + t et , for t the Bellman error and et = et-1 t Vt(st)
Gradient-Descent TD() for V Initialize arbitrarily and e = 0 Repeat (for each episode) s initial state of episode Repeat (for each step of the episode) a action given by for s Take action a, observe reward, r, and next state, s’ r + V(s’) – V(s) e e V(s) + e Until s is terminal
Approximation Models • Most widely used (and simple) G-D: • NN using the error backpropagation algorithm • Linear form • Factored value functions (your present.) [GPK] • First-order value functions (your present.) [BRP] • Restricted policies (your present.) [KMN]
Approximation Models • Most widely used (and simple) G-D: • NN using the error backpropagation algorithm • Linear form • Factored value functions (your present.) [GPK] • First-order value functions (your present.) [BRP] • Restricted policies (your present.) [KMN]
Function Approximation • Common approach to solving MDPs • find a functional form f(q)for VF that is tractable • e.g., not exponential in number of variables • attempt to find parameters q s.t. f(q) offers “best fit” to “true” VF • Example: • use neural net to approximate VF • inputs: state features; output: value or Q-value • generate samples of “true VF” to train NN • e.g., use dynamics to sample transitions and train on Bellman backups (bootstrap on current approximation given by NN)
Linear Function Approximation • Assume a set of basis functionsB = { b1 ... bk } • each bi : S → generally compactly representable • A linear approximator is a linear combination of these basis functions; for some weight vector w: • Several questions: • what is best weight vector w ? • what is a “good” basis set B ? • what does this buy us computationally?
Flexibility of Linear Decomposition • Assume each basis function is compact • e.g., refers only a few vars; b1(X,Y), b2(W,Z), b3(A) • Then VF is compact: • V(X,Y,W,Z,A) = w1 b1(X,Y) + w2 b2(W,Z) + w3 b3(A) • For given representation size (10 parameters), we get more value flexibility (32 distinct values) compared to a piecewise constant rep’n
Interpretation of Averagers Averagers Interpolate: y1 y2 Grid vertices = Y x y4 y3
Linear methods? • So, if we can find good basis sets (that allow a good fit), VF can be more compact • Several have been proposed: coarse coding, tile coding, radial basis functions, factored (project on a few variables), feature selection, first-order
Linear Approx: Components • Assume basis set B = { b1 ... bk } • each bi : S → • we view each bias an n-vector • let A be the n x k matrix [ b1 ... bk ] • Linear VF: V(s) = S wi bi(s) • Equivalently: V = Aw • so our approximation of V must lie in subspace spanned by B • let B be that subspace
Policy Evaluation • Approximate state-value function is given by Vt(s) = S wi bi(s) • The gradient of the approximation value function with respect to t in this case is wi Vt(s) = bi(s)
Approximate Value Iteration • We might compute approximate V by • Let V0 = Aw0 for some weight vector w0 • Perform Bellman backups to produce V1 = Aw1; V2 = Aw2; V3 = Aw3; etc... • Unfortunately, even if V0 in subspace spanned by B, L*(V0) = L*(Aw0) will generally not be • So we need to find best approximation to L*(Aw0) in B before we can proceed
Projection • We wish to find a projection of our VF estimates into Bminimizing some error criterion • We’ll use max norm • Given V lying outside B, we want a w s.t: || Aw – V || is minimal
Projection as Linear Program • Finding a w that minimizes || Aw – V || can be accomplished with a simple LP • Number of variables is small (k+1); but number of constraints is large (2 per state) • this defeats the purpose of function approximation • but let’s ignore for the moment Vars: w1, ..., wk, f Minimize: f S.T. f V(s) – Aw(s) , s f Aw(s) - V(s) , s f measures max norm difference between V and “best fit”
Approximate Value Iteration • Run value iteration; but after each Bellman backup, project result back into subspace B • Choose arbitrary w0 and let V0 = Aw0 • Then iterate • Compute Vt=L*(Awt-1) • Let Vt =Awtbe projection of Vtinto B • Error at each step given by f • final error, convergence not assured • Analog for policy iteration as well
Convergence Guarantees Monte Carlo: converges to minimal MSE || Q’MC - Qp||p= minw || Q’w - Qp||p ºe TD(0) converges close to MSE || Q’TD - Qp||p=e /(1-g) [TV] DP may diverge There exists counter examples [B,TV]
VFA Convergence results • Linear TD(l) converges if we visit states using the on-policy distribution • Off policy Linear TD(l) and linear Q learning are known to diverge in some cases • Q-learning, and value iteration used with some averagers (including k-Nearest Neighbour and decision trees) has almost sure convergence if particular exploration policies are used • A special case of policy iteration with Sarsa style updates and linear function approximation converges • Residual algorithms are guaranteed to converge but only very slowly
VFA TD-gammon • TD(l) learning and a Backprop net with one hidden layer • 1,500,000 training games (self play) • Equivalent in skill to the top dozen human players • Backgammon has ~1020 states, so can’t be solved using DP
Homework • Read about POMDPs: [Littman ’96] Ch.6-7
FA, is it the answer? • Not all FA methods are suited for the use of RL. For example, NN and statistical methods assume a static training set over which multiple passes are made. In RL, however, it is important that learning be able to ccur on-line, while interacting with the environment (or a model of the environment) • In particular, Q learning seems to lose its convergence properties when integrated with FA
A linear, gradient-descent version of Watkins’s Q() Initialize arbitrarily and e = 0 Repeat (for each episode) s initial state of episode For all aA(s) F(a) set of features present in s,a Qa iF(a)(i) Repeat (for each step of episode) With probability 1- a arg maxa Qa e e else a a random action A(s) e 0 For all iF(a): e(i) e(i)+1 Take action a, observe reward r and next state s’ r – Qa For all aA(s’) F(a) set of features present in s’,a Qa iF(a)(i) a’ arg maxa Qa + Qa’ + e until s’ is terminal