190 likes | 577 Views
Bayesian inference. Gil McVean, Department of Statistics Monday 17 th November 2008. Questions to ask…. What is likelihood-based inference? What is Bayesian inference and why is it different? How do you estimate parameters in a Bayesian framework? How do you choose a suitable prior?
E N D
Bayesian inference Gil McVean, Department of Statistics Monday 17th November 2008
Questions to ask… • What is likelihood-based inference? • What is Bayesian inference and why is it different? • How do you estimate parameters in a Bayesian framework? • How do you choose a suitable prior? • How do you compare models in Bayesian inference?
A recap on likelihood • For any model the maximum information about model parameters is obtained by considering the likelihood function • The likelihood function is proportional to the probability of observing the data given a specified parameter value • The likelihood principle states that all information about the parameters of interest is contained in the likelihood function
An example • Suppose we have data generated from a Poisson distribution. We want to estimate the parameter of the distribution • The probability of observing a particular random variable is • If we have observed a series of iid Poisson RVs we obtain the joint likelihood by multiplying the individual probabilities together
Relative likelihood • We can compare the evidence for different parameter values through their relative likelihood • For example, suppose we observe counts of 12, 22, 14 and 8 from a Poisson process • The maximum likelihood estimate is 14. The relative likelihood is given by
Maximum likelihood estimation • The maximum likelihood estimate is the set of parameter values that maximise the probability of observing the data we got • The mle is consistent in that it converges to the truth as the sample size gets infinitely large • The mle is asymptotically efficient in that it achieves the minimum possible variance (the Cramér-Rao Lower Bound) as n→∞ • However, the mle is often biased for finite sample sizes • For example, the mle for the variance parameter in a normal distribution is the sample variance
Confidence intervals and likelihood • Thanks to the CLT there is another useful result that allows us to define confidence intervals from the log-likelihood surface • Specifically, the set of parameter values for which the log-likelihood is not more than 1.92 less than the maximum likelihood will define a 95% confidence interval • In the limit of large sample size the LRT is approximately chi-squared distributed under the null • This is a very useful result, but shouldn’t be assumed to hold • i.e. Check with simulation
Likelihood ratio tests • Suppose we have two models, H0 and H1, in which H0 is a special case of H1 • We can compare the likelihood of the MLEs for the two models • Note the likelihood under H1 can be no worse than under H0 • Theory shows that if H0 is true, then twice the difference in log-likelihood is asymptotically c2 distributed with degrees of freedom equal to the difference in the number of parameters between H0 and H1 • The likelihood ratio test • Theory also tells us that if H1 is true, then the likelihood ratio test is the most powerful test for discriminating between H0 and H1 • Useful, though perhaps not as useful as it sounds
Criticisms of the frequentist approach • The choice between models using P-values is focused on rejecting the null rather than proving the appropriateness of the alternative • Representing uncertainty through the use of confidence intervals is messy and unintuitive • Cannot say that the probability of the true parameter being within the interval is 0.95 • The frequentist approach requires a predefined experimental approach that must be followed through to completion (at which point data are analysed) • Bayesian inference naturally adapts to interim analysis, changes in stopping rules, combining data from different sources • Focusing on point estimation leads to models that are ‘over-fitted’ to data
Bayesian estimators • Bayesian statistics aims to make statements about the probability attached to different parameter values given the data you have collected • It makes use of Bayes’ theorem Prior Likelihood Posterior Normalising constant
Are parameters random variables? • The single most important conceptual difference between Bayesian statistics and frequentist statistics is the notion that the parameters you are interested in are themselves random variables • This notion is encapsulated in the use of a subjective prior for your parameters • Remember that to construct a confidence interval we have to define the set of possible parameter values • A prior does the same thing, but also gives a weight to different values
Example: coin tossing • I toss a coin twice and observe two heads • I want to perform inference about the probability of obtaining a head on a single throw for the coin in question • The MLE of the probability is 1.0 – yet I have a very strong prior belief that the answer is 0.5 • Bayesian statistics forces the researcher to be explicit about prior beliefs but, in return, can be very specific about what information has been gained by performing the experiment • It also provides a natural way for combining data from different experiments
The posterior • Bayesian inference about parameters is contained in the posterior distribution • The posterior can be summarised in various ways Posterior mean Posterior Prior Credible Interval
Choosing priors • A prior reflects your belief before the experiment • This might be relatively unfocused • Uniform distributions in the case of single parameters • Jeffreys prior (and other ‘uninformative’ priors) • Or might be highly focused • In the coin-tossing experiment, most of my prior would be on P=0.5 • In an association study, my prior on a SNP being causal might be 1/107
Using posteriors • Posterior summary to provide statements about point estimates and certainty • Posterior prediction to make statements about future events • Posterior predictive simulation to check the fit of the model to data
Bayes factors • Bayes factors can be used to compare the evidence for different models • These do not need to be nested • Bayes factors generalise the likelihood ratio by integrating the likelihood over the prior • Importantly, if model 2 is a subset of model 1, it does not follow that the Bayes factor is necessarily greater than 1 • The subspace of model 1 that improves the likelihood may be very small and the extra parameter carry extra cost • It is generally accepted that a BF of 3 is worth mention, a BF of 10 is strong evidence and a BF of 100 is decisive (Jeffreys)
Example • Consider the crossing data of Bateson and Punnett in which we want to estimate the recombination fraction • I will use a beta prior for the recombination fraction with parameters 3 and 7
Conditional on the total sample (381), the likelihood function is described by the multinomial • We get the following posterior distribution • Comparing the model to one in which r = 0.5 gives a BF of 3.9 Posterior mean = 0.134 Posterior mode = 0.13 95% ETPI = 0.10 – 0.16
Bayesian inference and the notion of shrinkage • The notion of shrinkage is that you can obtained better estimates by assuming a certain degree of similarity among the things you want to estimate and a lack of complexity • Practically, this means three things • Borrowing information across observations • Penalising inferences that are very different from anything else • Penalising more complex models • The notion of shrinkage is implicit in the use of priors in Bayesian statistics • There are also forms of frequentist inference where shrinkage is used • But NOT MLE