1 / 75

Either explicitly or implicitly, all the well-formed methods we have been able to

Appendix: Additional Thoughts and Technical Detail Concerning Data Priors and Model Selection by Prediction.

branxton
Download Presentation

Either explicitly or implicitly, all the well-formed methods we have been able to

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Appendix: Additional Thoughts and Technical DetailConcerning Data PriorsandModel Selection by Prediction

  2. There are several methods by which data priors might be incorporated into model selection for BMS and NML. We will describe an approach/description that leads to extensions to NML* and to prediction/cross validation.

  3. Treat the prior probability of a given data outcome, D0i, as the probability that the outcome of a (virtual) prior study had been D0i. • Bayesian analysis allows combination of successive results (posterior of first is prior for second). • To a certain degree, our methods maintain consistency with this Bayesian property. Either explicitly or implicitly, all the well-formed methods we have been able to formulate treat the prior probability of a given data outcome, D0i, as the probability that the outcome of a prior study had been D0i.

  4. To set the stage consider first a case where we have a single prior data set with some assigned probability.

  5. Suppose we have current data D1, and prior data D2, but we believe that there is only a probability of say 0.5 that this report of prior data is true. • Bayesian analysis would say with probability 0.5 that p(Mi|D) = p(Mi|D1&D2) and with probability 0.5 p(Mi|D) = p(Mi|D1). So, • P(Mi|D) = (.5)p(Mi|D1&D2) + (.5)p(Mi|D1)

  6. In this approach the posteriors are mixed in accord with our beliefs. • There are two ways to generalize this idea: • In one, the different possible examples of prior data are all independently possible, and each has a probability in (0,1)—the probabilities need not sum to 1.0.

  7. In this generalization, if we have a data prior p*(Dj) then every possible combination of these data possibilities will have an associated probability: E.g. p(Dki, Dk2, Dk3) = p*(Dk1)p*(Dk2)p*(Dk3). • We then calculate a posterior for each of these combinations with present data D1.

  8. The resultant posterior is the probabilistic sum of all the posteriors. Let each combination of prior data possibilities be denoted Ck with associated probability p(Ck). Then • P(Mi|D*,D1) = Σp(Ck)p(Mi|Ck&D1)

  9. This approach seems well formed in terms of traditional Bayesian analysis. • It has one advantage in that it can handle the case of two actual prior studies, thus each having p = 1.0. • Of course it is completely useless in practice given the number of possible prior data sets, and the explosion of all possible subsets of those.

  10. The second way to generalize is the one we pursue: • Suppose we stipulate that the data prior represents examples of data outcomes, exactly one of which is true. Thus the data prior probabilities add to 1.0

  11. If exactly one was true, then we need consider only each combination of D1 & D*k: • P(Mi|D) = Σkp(D*k)p(Mi|D1&D*k) • For model selection we sum these across the Mi in each model class: • ΣiΣkp(D*k)p(Mi|D1&D*k) • It is this approach and justification that we pursue in the following.

  12. We use the matrix of all model classes. • For Model Classes M = 1, 2, etc., index all their parameters by θM: (θ1, θ2, …, θi). • Assume we have a parameter prior, P0(θM), in addition to a data prior, P0(Di). [To maintain this consistency, and other desirable properties, it is critical that we start with the full matrix of all models, not just one model at a time. If we have Models M = J, K, etc., index all their parameters by θM: (θ1, θ2, …, θi). We will omit the M when referring to all the models together, and place J, K, etc. in the superscript when we want to restrict consideration to the parameters of a given class. Assume we have a parameter prior, P0(θ), in addition to a data prior, P0(Di).]

  13. The BMS* criterion for Model Class K (with parameters θK) is the joint probability of the observed data and all parameters θi within Model K: p(Dobs, K) = Σip(Dobs, θKi). • Thus we are basing model selection on the prediction that Model K will hold and that the data to be observed will occur.

  14. Now we consider the data prior, P0(Di), as the probability that Di had already been observed, prior to the present study.

  15. Let D0i represent outcome Di of an identical (hypothetical, virtual) prior study. We note that: p’(Dobs, K|D0i) = p’(Dobs, D0i, K)/p0(D0i) Here p’ refers to the probability calculated on the basis of the original model with the specified parameter prior, but without the separately specified data prior. [All the Bayesian methods we consider for incorporating data priors start by calculating a joint BMS score for each virtual prior data outcome, the present observed data outcome, and Model K: P(D0i, Dobs, K) This is a sum of the joint probabilities over the parameter values of Model K, within the vector of the 3-D matrix defined by D0i and Dobs.]

  16. This characterization contains the joint probability of the virtual prior study data outcome, the present study data outcome, and the parameters of all the models. • Thus a graphic depiction of our extended method should use a 3-D joint matrix. (not yet available as a picture)

  17. The 3-D matrix has one axis for the parameters of all models, another for the outcomes of the present study, and a third for the outcomes of the ‘virtual’ prior study: • On one axis are the parameter values of the various models: • θK1θK2θK3..θKn1θJ1θJ2θJ3…θJn2 • On the second axis are the potential data outcomes of the present study: • D1, D2, D3, ….. • On the third axis are the (virtual) data outcomes of the prior study: • D01, D02, D03, ….

  18. A given entry in this joint matrix is obtained from the model and the parameter prior: P(θi,D0j,Dk) = P0(θi)P(D0j|θi)P(Dk|θi) This characterization assumes independence of the actual present study and virtual prior study –an assumption that seems reasonable.

  19. All the Bayesian methods we consider for incorporating data priors start by calculating a joint BMS score for each virtual prior data outcome, the present observed data outcome, and Model K: P(D0i, Dobs, K) • This is a sum of the joint probabilities over the parameter values of Model K, within the vector of the 3-D matrix defined by D0i and Dobs.

  20. We now want to weight this BMS joint score by the probabilities of the virtual prior data—i.e. the data prior—and sum across all the virtual prior outcomes. The justification for this probability mixing was given in the introduction to this section on data priors. • The weights justified in the introduction lead to the following BMS criterion: BMS*DP(K) = Σip’(Dobs, D0i, K)p0(D0i) The Upper Case K here means we are really taking two sums, one over data and the other over the model class.

  21. In this formulation there are normalizing constants that could be altered to allow two other weight descriptions. Thus we have considered all three of the following weights equations: (the weights are in the brackets): Σip’(Dobs, D0i, K)[p0(D0i)] (1) Σip’(Dobs, D0i, K)[p0(Di)/pA(Di)] (2) Σip’(Dobs, D0i, K)[p0(D0i)/pA(Dobs,D0i)] (3) [Recall that A refers to the marginaldata probability obtained using the original parameter prior.]

  22. Note that the weighting scheme is a matter of convenience: One can choose to assign the same weights whatever scheme is assumed. But mathematical equivalence is not conceptual equivalence—when we know something about data (a prior), we have to know how to specify this knowledge. I.e. the interpretation of the weights changes with the underlying assumptions, so when translating prior knowledge into weights, one must do so in a way consistent with the assumptions.

  23. BMS*DP(K) = Σip’(Dobs, D0i, K)p0(D0i), • For this weighting scheme, we can interpret the weight as the probability that of all outcomes that could have occurred in our prior (virtual) study, this was the outcome that did occur.

  24. For this method, equal weights gives a constant times the BMS* criterion and score that would have been obtained in normal BMS, without a data prior, so relative model selection remains unchanged. • This formulation also has the nice property that adding additional models to the set under consideration will not change the relative BMS preference among the initial models.

  25. We must keep our justification clearly in mind. Suppose we had an actual prior replication. One could list the outcome as a data prior with p = 1.0. If we did so, then our favored formula would combine the two results in the usual Bayesian manner. But this is a special case. • In general, given for two or more actual prior studies, the results would have to be combined before using the formula.

  26. The method we propose is of course not aimed to apply to actual prior replications. In general our prior data knowledge is not from replications, but from relevant sources of all kinds, including vaguely similar studies, conceptual thinking, vetted theories, and so on.

  27. Our formulation applies to equal or unequal model class probabilities. • To extend to NML it is convenient to restate in terms of means;

  28. In equation form, the BMS data-prior criterion stated in terms of means is: Σi [μK(Dobs,D0i,K)/ΣhΣjμK(Dj,D0h,K)]P0(K)P0(D0i) Here i (and h) index the prior data outcomes and j indexes the present data outcomes. Hence the term in brackets assumes a particular prior data outcome D0i, and is the mean for the present observed data and Model K and prior data outcome D0i divided by the sum of means for Model K across all present data outcomes and prior data outcomes. This term is multiplied by the prior probability of Model K. Finally we weight such terms by the probability of the prior data outcome and sum.

  29. NML with Data Priors We previously characterized BMS and NML in terms of differences between means and maxima. Thus to obtain an NML data-prior model selection criterion, it is natural to replace the means in this data-prior BMS model selection equation with maxima.

  30. The NML result becomes: Σi [MAXK(Dobs,D0i,K)/ΣjΣhMAXK(Dj,D0h,K)]w0(K)P0(D0i) In words, one calculates an NML-like score for each Model Kby taking the max within the vector defined by Dobsand D0iand then dividing by the sum of such maxs for Model Kforall present and past data outcomes. This ratio is multiplied by a model weight w0(K) and the result is weighted by the data prior P0(D0i) and summed over prior data outcomes. The weight w0(K) could reasonably be set to equal the sum of the prior probabilities, or weights, over the Model K class.

  31. For the purposes of exposition we have up to this point been too restrictive in the types of data priors allowed. We now wish to generalize, in an entirely natural way that does not alter the approach.

  32. We have characterized data priors (as in the equation below) in terms of alternative outcomes of the present study. BMS*DP(K) = Σip’(Dobs, D0i, K)p0(D0i), However, the data prior could be based on a different virtual study E.g. Prior study: Accuracy; Present study: Response Time Simpler variants are of course also possible, such as choosing a data prior for a study with a smaller number of observations than the present study.

  33. BMS*DP(K) = Σip’(Dobs, D0i, K)p0(D0i) There is another important proviso, and generalization: The prior data (study) and the present data study could be based on the same model class, but different parameters. In general we want the parameters in the two cases to be similar, but not identical. In effect we have to have a prior for the covariance of the parameters.

  34. Thus we need a joint prior on the parameters of the two (or more) studies: P0(θobs,g,θ0,h) For a given prior data outcome, the joint probability of D0i and Dobs becomes

  35. p’(Dobs, D0i, K) = ΣgΣhP0(θobs,g,θ0,h)p(Dobs|θobs,g)p(D0i|θ0h) We then apply the previously specified equation for data priors, using this joint probability. [As usual these joint probabilities are calculated on the basis of all the model classes under consideration].

  36. How do we specify P0(θobs,g,θ0,h)? • It is usual to do so using a hierarchical structure: P0(θobs,g,θ0,h) = P0(θobs,g|η)P0(θ0,h|η)P0(η) • Here η specifies the joint distribution over θobs, and θ0. For example this might be a gaussian distribution with some parameterized variance/covariance structure.

  37. Given that we are inventing a data prior from a whole set of partially relevant and often vague considerations, it is not entirely clear whether any of this new machinery is necessary. It may be sufficient to simply provide a best guess for the outcome of the present study.

  38. This approach for BMS can be carried over to NML, but details are omitted.

  39. Thus we have a data prior method for model selection. • But what about computation? Is the method usable? • Without data priors (or ‘flat’ ones) standard sampling methods suffice for BMS, and BMS scores could be considered an approximation to NML. [Even without a separately specified data prior, computing an NML score is far more difficult than computing a BMS score, and typically not feasible computationally (because the max of distributions is much harder to obtain by sampling approaches than are means). We therefore suggested it might be best to compute a BMS score (and treat the result as an approximation to NML, if we prefer).]

  40. The need to specify a data prior introduces an additional issue of computational complexity, for both BMS and NML: The data outcome space is often and typically extremely complex and high dimensional (much more so than the parameter space—after all the models are aimed to simplify the data space). • Thus it may be very difficult to specify the data outcome prior, and almost impossible to sum weighted model selection scores across it.

  41. We believe that we may be able to approximate the computation by using a reasonable size sample of data outcomes, each with appropriately assigned probabilities. • We will be exploring this hypothesis with simulations in the near future.

  42. In summary: • Using the joint probability characterization outlined in Part I, and treating data priors in terms of probabilistic outcomes of a virtual prior study, we can develop a data prior approach to model selection, one that can be stated in both BMS and NML terms.

  43. Research in Progress:Comments, criticisms and suggestions are welcome.

  44. Extensions to Cross Validation Methods [This section is in a more preliminary phase of construction than the prior sections]

  45. Extensions to Predictive Validation and Cross Validation (CV) We want to start by saying that CV methods also need to take priors into account: What we know determines inference in all methods

  46. The third major class of model selection methods involve prediction and cross validation. Such methods include many cross-validation variants, prequential prediction methods that include accumulated prediction error, and certain bootstrapped simulation versions of these methods.

  47. For example, one might prefer a model that fit to one half a data set, best predicts the other half of the data (split-half cross-validation). Or one might predict a single observation from the fit to all the others, and do this over and over (leave-one-out cross validation). Or one might sequentially use a model fit to trials 1 to n to predict n+1, and accumulate all the prediction error, favoring the model that has the lowest APE (prequential analysis).

  48. We now want to consider the relation of these approaches to our proposed BMS and NML data-prior-based methods, and propose ways to incorporate data and parameter priors into these methods.

  49. One might think the prediction methods eliminate the need to introduce parameter priors and data priors, but this thinking represents a serious conceptual error. • To take just one simple example among an infinite set of examples, if we know before a study that Model A is one hundred times as probable as Model B, but analysis of the study shows Model B cross-validates slightly better than A, it would not be sensible to prefer Model B.

  50. To take another simple example, suppose we are testing a fair coin vs a biased coin. We divide the trials in half and see how each model fit to one half predicts the other half. If the observed data shows a preponderance of tails, the bias model will do best at cross validation (the excess tails tend to occur in both halves of the split sample). But suppose I have prior knowledge that tells me that if a coin is biased it has a bias towards heads. Now the bias model should do less well, and the fair coin model should do much better.

More Related