Bayesian statistics –MCMC techniques How to sample from the posterior distribution
Bayes formula – the problem term Bayes theorem: Seems straight forward. You multiply the likelihood (which you know) with your prior (which you have specified). So far, so analytic. But, the integral may not be analytically tractable. Solutions: Numerical integration or Monte Carlo-integration of the integral ... or sampling from the posterior distribution using Markov chain Monte Carlo (MCMC).
Normalization Bayes theorem: • Note that the problematic integral is simply a normalization factor. The result of that integral contains no function dependency on the parameters, . Normalization is that constant in a probability density function which makes the probabilities integrate to one. • The functional form of the posterior distribution depends only on the two terms in the denominator, the likelihood f(D| ) and the prior f(). • If we can find a way of sampling from a distribution without knowing it’s normalization constant, we can use those samples to tell us about all properties of the posterior distribution (mean, variance, covariance, quantiles etc.) • How to sample from a distribution f(x)=g(x)/A without knowing A?
Markov chains Organize a series of stochastic variables, X1,...,Xn like this: Each variable depends on it’s past only through the nearest past. f(xt|xt-1,...,x1)=f(xt|xt-1). This simplifies the model for the combined series a lot. f(xt,xt-1,...,x1)= f(xt|xt-1,...,x1) f(xt|xt-1,...,x1)...f(x1). Markov chains can be very practical for modelling data with some time-dependecy. Example: The autoregressive model in the “water temperature” series. … … X1 X2 X3 Xi-1 Xi Xi+1 Xn
Markov chains –stationary distribution X1 X2 X3 X4 X5 X6 Each element in the chain will have a distribution. By simulating the chain multiple times and plotting the histograms, you get a sense of this. Sometimes, a Markov chain will converge towards a fixed distribution, the stationary distribution. Ex: Random walk with reflective boundaries (-1,1). Xi=Xi-1+ i, iN(0,2), if Xi<-1 then set Xi to -1-(Xi+1), if Xi>1 then set Xi to 1-(Xi-1). While we sampled from the normal distribution, we get something that is uniformly distributed. Even if we did not know how to sample from the uniform directly, we could do it indirectly using this Markov chain. Idea: Make a Markov chain that converges towards the posterior, f(|D).
MCMC – why? i=i* accept Propose a new value from the previous values, i*,f(i*| i-1) • Make a Markov chains timeseries of parameter samples, (1, 2, 3,…), which has stationary distribution f(|D), in the following way (Metropolis-Hastings): • Acceptance with probability: • We don’t need the troublesome normalization constant, f(D). We can drop it! • We may even drop further normalization constants in the likelihood, prior and proposal density. • Solves the normalization problem in Bayesian statistics! i-2 i-1 reject i=i-1
Everything solved? What about the proposal distribution? • Almost every proposal distribution, i*,f(i*| i-1), will do. There’s a lot of (too much?) freedom here. • Can restrict ourselves to go through each the parameters, one at a time. For =((1), (2)), propose a change in (1), accept/reject, then do the same for (2). • Pros: Simpler. Cons: Great parameter dependency means little change. • Special case: A Gibbs sample of a parameter comes from it’s distribution conditioned on the data and all the rest of the parameters. • The proposal is now specified by the model, so no need to choose anything. Automatic methods possible, where we only need to specify the model (WinBUGS). • Make a Markov chains timeseries of parameter samples, (1, 2, 3,…), which has stationary distribution f(|D), in the following way (Metropolis-Hastings): • Acceptance with probability: • We don’t need the troublesome normalization constant, f(D). We can drop it! • We may even drop further normalization constants in the likelihood, prior and proposal density. • Solves the normalization problem in Bayesian statistics! (1) f((1), (2)|D) (2)
MCMC - convergence Sampling using MCMC means that if we keep doing it for enough time, we get a sample from the posterior distribution. Question 1: How much is “enough time”? Markov chains will not converge immediately. We need ways of testing whether convergence has occurred or not. • Start many chains (starting with different initial positions in parameter space). When do they start to mix? Compare plots of the parameter values or of the likelihoods. • Gelman’s ANOVA test for MCMC samples. • Check if we are getting the sample results if we rerun the analysis from different starting points. After determining the point of convergence, we discard the samples from before that (burn-in) and only keep samples from later than this point.
MCMC - dependency Question 2: How many samples do we need? In a Markov chain, one sample depends on the previous one. How will that affect the precision of the total sample and how should we relate to that? • Look at the timeseries plot of a single chain. Do we see dependency? • Estimate auto-correlation -> effective number of samples. • Keep only each k’th sample, so that the dependency is almost gone. We can then get a target number of (almost) independent samples. If we have a set of independent samples, we can then use standard sampling theory to say something about the precision of means and covariances, for instance. Autocorrelation plot. Keep each sample vs keep each 600th sample.
MCMC – causes for concern • High posterior dependency between parameters: Gives high dependency between samples and thus low efficiency. Look at scatterplots and to see if this is the case. • Multimodality – posterior density has many “peaks”. Chains starting from different places “converge” to different places. Greatly reduced mixing. Usual solution within WinBUGS: Re-parametrize or sample a lot!