Developing Dynamic Treatment Regimes for Chronic Disorders

Developing Dynamic Treatment Regimes for Chronic Disorders S.A. Murphy Univ. of Michigan RAND: August, 2005

Goals • Today • Review • Four categories of methods for constructing dynamic treatment regimes using data. • Generalization error

Review • Definition of a dynamic treatment regime and the role of tailoring variables, decision options and decision rules • Using scientific theory, clinical experience and expert opinion to construct dynamic treatment regimes • Designing randomized experiments to inform the construction of dynamic treatment regimes

Conceptual Formulation

Four Categories of Methods for Constructing Dynamic Treatment Regimes Using Data (Secondary Analyses)

k Decisions on one individual Observation made prior to jth decision point Decision or action at jth decision point Present and past observations Present and past decisions

k Decisions Observation made prior to jth decision point Decision or “action” at jth decision point “Reward” following jth decision point Primary Outcome:

Goal: Construct decision rules that input data at each decision point and output a recommended decision; these decision rules should lead to a maximal mean Y. The dynamic treatment regime is the sequence of decision rules:

In the future we offer treatment An example of a simple decision rule is: alter treatment at time j if otherwise maintain on current treatment.

Nature is your best friend and tells you all you need to know! You know for all values of

Use Dynamic Programming: (k=2)

Data (k=2) on n subjects Decision/action at jth decision point; assigned via randomization (SMART). Then

Data (k=2) on n subjects. Suppose n is large, the observation space is small (e.g. binary scalar), the decision space is small, and k is small. Use a nonparametric model for and construct decision rules via dynamic programming.

Our Setting (k=2) The number of subjects n is not large compared to the size of the (observation space, decision space). We are forced to use function approximation (i.e. parametric, semiparametric models) to add information to the data (e.g. to effectively pool over subjects’ data). The game is “What are you going to approximate?”

Four Categories of Methods • Likelihood-based (Thall et al. 2000, 2002; POMDP’s in reinforcement learning) • Q-Learning (Watkins, 1989) (a popular method from reinforcement learning) • ---regression • A-Learning (Murphy, 2003; Robins, 2004) ---regression on a mean zero space • Weighting (Murphy, et al., 2002, related to policy search in reinforcement learning) ---weighted mean

Q-Learning

Q-Learning Approximate using some approximation class (splines, linear regression, etc). Use regression (least squares, penalized least squares)

A Simple Version of Batch Q-Learning Approximate j=1,2

Decision Rules:

Disadvantages of batch Q-Learning that motivate A-Learning: • We are adding information that we do not need in order to construct the decision rules. Essentially this information assists in pooling subjects’ data. If this information is false then any decrease in variance achieved by using the information may be offset by bias. • The unnecessary information may imply that implicit, unpleasant, assumptions are made on the conditional distribution of each observation given the past.

Disadvantages of batch Q-Learning that motivate A-Learning: • Depending on the approximation class (e.g. the model) for the Q-functions, the model for the system dynamics may not be coherent; that is, it maynot be possible for this model to be true; that is, you might not be able to generate data with these models as the true Q-functions. • Usually when we add information to the data, this information is related to our understanding of the causal structure. It turns out that in this case, the unnecessary information concerns non-causal quantities.

First point: Recall You have modeled both the advantage and the value functions. You don’t need to model the value.

Second and third points: Recall you modeled and you have a model for

Fourth point: In general the weightings of the various observations in the Q function are non-causal!! Why is this? Berkson’s paradox.

A-Learning

A-Learning Approximate/parameterize the advantages:

A-Learning Equivalently we can approximate/parameterize: Why is this equivalent to approximating the advantages???

A-Learning The estimating equations will be in terms of are randomization probabilities

A Simple Version of Batch A-Learning Approximate j=1,2. Then the estimating function is in terms of

A Simple Version of Batch A-Learning

Constructed Decision Rules:

Disadvantages of batch Q-Learning that motivated A-Learning: • We are adding information that we do not need in order to construct the decision rules. Essentially this information assists in pooling subjects’ data. The unnecessary information may imply that implicit, unpleasant, assumptions are made on the conditional distribution of each observation given the past. • Depending on the approximation class (e.g. the model) for the Q-functions, the model for the system dynamics may not be coherent; that is, it maynot be possible for this model to be true; that is, you might not be able to generate data with these models as the true Q-functions.

Telescoping decomposition of conditional mean

Disadvantages of batch Q-Learning that motivated A-Learning: • Usually when we add information to the data, this information is related to our understanding of the causal structure. It turns out that in this case, the unnecessary information involves non-causal quantities.

Disadvantages of batch A-Learning that motivate Weighting: • We are adding information that we do not need in order to construct the decision rules. Essentially this information assists in pooling subjects data. If this information is false then any decrease in variance achieved by using the information may be offset by bias. !!!! • Except in very simple cases we are implicitly (as opposed to explicitly) approximating/modeling the best decision rules. Often experts have good ideas about the form of the best decision rules.

Disadvantages of batch A-Learning that motivate Weighting: • Often there are constraints on the decision rules. We will want to find the best decision rules within the constrained space. These constraints may be: • Decision rules must make sense to clinician and patient • In impoverished environments may not have access to all of the observations collected in experimental trial.

Weighting Consider the likelihood of the data: If the regime is implemented then the likelihood is

Weighting Given parameterized decision rules We’d like to choose θto maximize

A Simple Version of Weighting Maximize over θ

Discussion • The four classes of methods make tradeoffs. • The different classes differ by the size of model space (parameter space). The more information you add via modeling assumptions, the smaller the model space and the lower the variance in estimation. And the greater the potential for bias due to the use of incorrect information. • What are we talking about? • Bias of what?! • Variance of what?! • Do we really want/need standard errors for the estimators of β’s, γ’s, θ’s??!

Generalization Error

Generalization Error • The use of function approximation combined with the methods, likelihood-based, Q-learning and A-learning, implicitly constrains the space of decision rules. • Yet do these methods try to find the best decision rules within the constrained space? No. • What is the goal here? • We want to be able to characterize the generalization error (mean, variance, confidence interval, error bound, etc)

One decision only Data:on n subjects is randomized with probability

Goal: Find the decision rules that are best (maximize mean Y) within a restricted class of decision rules: e.g.,

The estimand is or equivalently

The generalization error is The bias is the mean of the generalization error

Developing Dynamic Treatment Regimes for Chronic Disorders