Connections between MCMC and Likelihood Methods

Connections between MCMC and Likelihood Methods Donald A. Pierce with Ruggero Bellio Winter 2010 OSU Slides are at www.science.oregonstate.edu/~piercedo/osu-mcmc-mpl.ppt

It is popular these days to be “Bayesian”, in large part due to the utility of MCMC and in particular (Win)BUGS However, substantive prior information is seldom used, aiming for “objective Bayes”, and connections to likelihood inference are interesting Largely, the gain in MCMC is in utilizing rather intractable likelihood functions: integrating over latent variates, e.g. latent cluster effects or covariates observed with error However, if everything except observed data is a random variable, issues of inference become highly (too?) automatic

A key issue in this is the contrast of profile and integrated likelihoods, namely Modern higher-order likelihood theory suggests, surprisingly, that integrated likelihoods can overcome shortcomings of profile likelihood A posterior for is an instance of integrated likelihood That is, so

An integrated likelihood is approximated very well by a Laplace approximation Hence, the MCMC posterior for “flat” priors is essentially We will see that this depends substantially on the representation of the nuisance parameter --- to be avoided in frequentist or likelihood inference The approximation above is, within reason, valid for any such representation (not that this is so comforting)

Regarding “flat” priors: in practice those used in WinBUGS manual examples seem advisable, i.e. proper but very diffuse for parameters on , e.g. dnorm(0,1E-6), and implicitly for the logs of inherently positive parameters, e.g. dgamma(1E-6,1E-6) The latter is to obtain approximate invariance to scale for scale parameters, a natural requirement If to facilitate convergence is chosen otherwise, then for likelihood analysis one should divide the posterior of by the prior Geyer & Thompson (1992 JRSS-B) gave a method for computing the likelihood using MCMC, but the proposal here is far simpler

An attempt to generally improve on profile likelihood was the Cox-Reid approximate conditional likelihood requiring that the nuisance parameter be represented as ‘orthogonal’ to , i.e. that varies slowly with However, orthogonal parameters are not at all uniquely defined, resulting in arbitrariness of the ACL that must be resolved A partial indication of our interests is that the ACL is formally the same as the above approximation to the posterior for using flat priors

Barndorff-Nielsen developed the modified profile likelihoodthat is invariant to representation of the nuisance parameter --- a really key issue Remarkable stroke of intuition, and B-N only showed that the MPL approximates what is desired for the primary special settings: exponential families, regression-scale models, etc We have been developing the idea that what the MPL in general approximates is a suitable integrated likelihood, hence with close connections to MCMC

Example (Pierce & Peters 1992): CC study, 40 sets with 2:1 matching, 30/80 of controls “exposed” Solid line PL, dashed lines conditional likelihood and MPL

Concept of ‘orthogonal’ parameter, for ACL and for MCMC, needs clarification In principle there is an ‘ideal’ choice of orthogonal parameter such that the integrated likelihood, i.e. the Bayes posterior (with uniform priors), approximates the MPL Some goals are: (a) to actually compute this, either from the likelihood or the posterior samples, (b) to recover the PL from the posterior distribution, and (c) to approximate the MPL in this way, even if not as in (a) These are not completed, but some progress has been made

Example: Binary data on 50 subjects, repeated observations at up to five times, total of 220 observations Suitable for logistic mixed model with latent random intercepts for subjects Interest parameter the standard deviation of the random intercepts. Seven nuisance parameters: constant term, 2 treatment parameters, 4 for time effects Usual parametrization is not orthogonal: vector of canonical regression parameters are ‘attenuated’ assuggesting an approximately orthogonal parameter

WinBUGS posterior densities of using flat priors: heavy line original parametrization, light line using the approximately orthogonal nuisance parameters

Posterior samples: Sigma vs constant term, for original and orthogonal parametrizations This provides a clue that we can use posterior samples to assess and correct for lack of orthogonality

Important but confusing issue –- clearly, if we transform the posterior samples asthe marginal distribution of is unchanged Part of reason reparametrization of matters is that this is done in the model specification, where in contrast to the above there is no (implicit) Jacobian involved in the density Having samples from the joint distribution of , it would be possible but impractical to divide the density by the Jacobian, to avoid re-doing MCMC We can achieve this aim otherwise by resampling from the posterior samples with weights inversely proportional to the reciprocal Jacobian

Recall that to very good approximation the MCMC posterior, for flat priors, is essentially which can be expressed approximately as We can approximate the final factor from the MCMC samples at hand, and thus approximate the PL by dividing the posterior density of by our estimate of There are, however, issues involving the distinction between posterior andsampling theory

A transparent way to do this, although there may be more accurate ways Choose bins for (e.g. 20 using quantiles), for each of these compute , and then smooth (the logs of) these by quadratic regression on the bin classmarks

Red right: MCMC posterior original parametrizationRed left (dashed): after above adjustmentBlack: PL computed by quadrature

What should be the meaning of ‘orthogonal’ parameter for use in the APL? Said earlier that should vary slowly with which is related to the more usual definition that the (expected) cross-information terms are zero But if satisfies this definition then so does any 1-1 transformation of it --- very unsatisfactory Further, this could not be a requirement for validity of APL, since linear transformations leave the APL unchanged even though not conforming at all to such requirements This suggests more difficulties than first thought in utilizing plots such as on slide 12 for such purposes

There is in principle a reparametrization such that MPL and IL agree (related to Severini, 2007 Bmtrka) The constrained MLE can be thought of as a function of if sufficient, otherwise If there is taken as a variable, this defines a nuisance parameter representation This representation of the NP depends on or on --- no real problem for Bayesian methods Define as the inverse function solving the equation Then the MPL is the Laplace approximation to the integrated likelihood based on representation of the nuisance parameter

Theory for this: Laplace approximation in parametrizations and differ only by Jacobian factor and we are matching that Jacobian with final factor of Actually need only derivatives Difficulty in all this is in utilizing, for likelihood, variations in while holding fixed a suitable ancillary “a” Roughly speaking, a suitable ancillary is the ratio of observed to expected information for

Ex: Two exponential samples with means and Reparametrize orthogonally with means Then provides the corresponding parametric function Set this equal to and solve for the inverse Then to up to Laplace approximation the MPL is the IL for nuisance parameter representation log PL log ACL and MCMC posterior with “obvious” orthog but for this example MPL=PL

Our MCMC example is not very suitable for investigating all this --- MPL is (again) very near the PL When likelihood is intractable, or when the MLE is not sufficient, can we use the MCMC to approximate the MPL? Is it better to approximate the reparametrization for which IL = MPL, or better to compute the required Jacobian more directly? An issue is whether there can, in principle, be enough information in the likelihood, or posterior samples, to approximate the MPL Can we tell from the posterior samples how the joint distribution would change for slightly different data?

There is yet another parametrization such that locally the nuisance parameter becomes a translation parameter In this parametrization the answer to that question is “yes” An aim is to capitalize on this without solving for that new parametrization, perhaps taking advantage of the fact that the product of the final two terms in the MPL is invariant to reparametrization Have had some success for a singlenuisance parameter, but there remains much to do

Connections between MCMC and Likelihood Methods

Connections between MCMC and Likelihood Methods

Presentation Transcript

Connections between Computer Science and Biology

Connections between generalizing and justifying

Making Connections Between Texts

Connections Between Mathematics and Biology

Likelihood Methods in Ecology

MCMC Methods in Harmonic Models

MCMC

Likelihood Methods in Ecology

Connections between CS and other disciplines

Inferring phylogenetic trees: Maximum likelihood methods

The Connections between Oral and Written Language

Connections between Network Coding and Matroid Theory

Making Connections Between Texts

Likelihood methods

Introduction to MCMC methods, the Gibbs Sampler, and Data Augmentation

MCMC Methods in Harmonic Models

Inference V: MCMC Methods

Likelihood methods

Inferring phylogenetic trees: Distance and maximum likelihood methods

Connections between Physics and the MPV

Likelihood Methods in Ecology

EXPLORING CONNECTIONS BETWEEN ASSESSMENT, MOTIVATION, AND METACOGNITION