1 / 33

A Finite Sample Upper Bound on the Generalization Error for Q-Learning

A Finite Sample Upper Bound on the Generalization Error for Q-Learning. S.A. Murphy Univ. of Michigan CALD: February, 2005. Outline Two Examples Q-functions & Q-Learning The Goal Finite Sample Bounds Discussion. Two Examples.

cynara
Download Presentation

A Finite Sample Upper Bound on the Generalization Error for Q-Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Finite Sample Upper Bound on the Generalization Error for Q-Learning S.A. Murphy Univ. of Michigan CALD: February, 2005

  2. Outline • Two Examples • Q-functions & Q-Learning • The Goal • Finite Sample Bounds • Discussion

  3. Two Examples • Treatment: Managing Drug Dependence, Mental Illness, HIV infection • Preventive Intervention: Increasing and Maintaining Activity • ---- Multi-stage decision problems: repeated decisions are made over time on each subject.

  4. Managing Drug Dependence • Goal is to reduce long term abuse. • Individuals present in an acute stage. • What should the first treatment be? (medication?, psychosocial therapy?) • At what point do we say the individual is not responding? • What should the second treatment be for nonresponders? (medication?, psychosocial therapy?) • What should the second treatment be for responders? • What information should be used in making these decisions?

  5. Improving Activity • Goal is to maximize long term weekly step counts. • Physically inactive individuals • How should the web coach set the weekly goal? • What framing should the web coach use in providing feedback on past week’s performance? • What information should the web coach (policy) use to make these decisions?

  6. Commonalities • One training set of finite horizon trajectories. • Actions made according to a known stochastic policy • System dynamics are poorly understood. • Policy class is restricted. • Explicitly constrained to be interpretable and/or • Implicitly constrained because of function approximation

  7. T+1 Decisions Observations made prior to tth time (vector of continuous and discrete variables) Action at tth time Reward at tth time

  8. The Goal Given a training set of n trajectories of the form, estimate the policy that maximizes the mean of over the class

  9. Q-functions & Q-Learning Definition: denotes expectation when the actions are chosen according to the policy, denotes expectation when the actions are chosen according to the stochastic exploration policy,

  10. Q-functions The Q-functions for a policy, are given recursively by For t=T,T-1,….

  11. Q-functions The Q-functions for an optimal policy, are given recursively by For t=T,T-1,….

  12. Q-functions An optimal (overall not only over ) policy is given by

  13. Q-learning with finite horizon trajectories Given an approximation space for the Q-functions, minimize over Set

  14. Q-Learning For each t=T-1,…,0 minimize over And set and so on.

  15. Q-Learning The estimated policy is given by

  16. The Goal Approximate each Q function by a linear combination of k features: implicitly constrains the class of policies: call this constrained class

  17. Goal:Given a learning algorithm and approximation classes assess the ability of learning algorithm to produce the best policy in the class. Construct an upper bound for where is the estimator of the policy. denotes expectation when the actions are chosen according to the policy

  18. We can expect that our estimator of the Q functions, say will be close to a projection if the training set is large. The projection is: and for t=T-1,…, 0,

  19. Finite Sample Bounds Primary Assumptions: (1) is invertible for each t (2) Number of possible actions is finite. (3) for L>1.

  20. Definition:

  21. For with probability at least 1- δ for n satisfying is the size of the action space; k is the number of features.

  22. The message The goal of the Q-learning algorithm is to produce that are close to This is different from the goal of producing a policy that will maximize

  23. For policies and with probability at least 1- δ for all satisfying

  24. for n satisfying the complexity constraint,

  25. Suppose there is a policy for which is maximal. Then for and with probability at least 1- δ for n satisfying the complexity constraint from before.

  26. is an approximation error: If then members of the approximation space are arbitrarily close to the optimal Q-function (optimal overall not just in )

  27. For policies and with probability at least 1- δ for n satisfying the complexity constraint from before.

  28. Difference in values can be related to Q-functions: Kakade (2003)

  29. The message The goal of the Q-learning algorithm is to produce that are close to This is different from the goal of producing a policy with that is close to

  30. The message Using function approximation in Q-learning provides a way to add information to the data but the price is bias. Other methods that add information (e.g. modeling the dynamics) can be expected to incur a bias as well.

  31. Discussion • When the information in the training set is small relative to the observation space, parameterizing the Q-functions is one way to add information. But how can one reduce the bias of Q-learning? • Policy search with importance weights? Low bias—high variance across training sets. Q-learning is lower variance—higher bias. • Can we construct algorithms that tell us when we must add information to the training set so as to reduce variability? • What kinds of statistical tests would be most useful in assessing whether the estimated policy is better than an ad hoc policy?

  32. This seminar can be found at: http://www.stat.lsa.umich.edu/~samurphy/seminars/cald0205.ppt The paper can be found at : http://www.stat.lsa.umich.edu/~samurphy/papers/ Qlearning.pdf samurphy@umich.edu

  33. A policy search method with importance sampling weights would employ a variant of

More Related