1 / 75

# Bayesian Methods and Subjectiv Probability - PowerPoint PPT Presentation

Bayesian Methods and Subjectiv Probability. Daniel Thorburn Stockholm University 2011-01-10. Outline. Background to Bayesian statistics Two simple rules Why not design-based? Bayes, Public statististics and sampling De Finetti theorem, Bayesian Bootstrap Comparisons between paradigms

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Bayesian Methods and Subjectiv Probability

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Bayesian Methods and Subjectiv Probability

Daniel Thorburn

Stockholm University

2011-01-10

### Outline

• Background to Bayesian statistics

• Two simple rules

• Why not design-based?

• Bayes, Public statististics and sampling

• De Finetti theorem, Bayesian Bootstrap

• Preposterior analysis

• Statistics in science

• Complementary Bayesian meethods

### 1. Background

• Mathematically:

• Probability is a positive, finite, normed, (s)-additive measure defined on a (s)-algebra

• But what does that correspond to in real life?

### What is the probability of heads in the following sequence? Does it change? And when?

• This is a fair coin

• I am now going to toss it in the corner

• I have tossed it but noone has seen the result

• I have got a glimpse of it but you have not

• I know the result but you dont

• I tell you the result

• Laplace definition. All outcomes are equally probable if there is no information to the contrary. (number of positive elementary events/number of possible elementary events)

• Choose heads and bet on it, with your neighbour. You get one krona if you are right and lose one if you are wrong. When should you change from indifference?

• Frequency interpretation. (LLN). If there is an infinite sequence of independent experiments then the relative frequency converges a.s. towards the true value. Cannot be used as a definition for two reasons

• It is a vicious circle. Independence is defined in terms of probability

• It is logically impossible to define over-countably many different quantities by a countable procedure.

### Probabilities do not exist (de Finetti)

• They only describe your lack of knowledge

• If there is a God almighty, he knows everything now, in the past and in the future. (God does not play dice, (Einstein))

• But lack of knowledge is personal, thus probability is subjective

• Kolmogorovs axioms only does not say anything about the relation to reality

• Probability is the language which describes uncertainty

• If you do not know a quantity you should describe your opinion in terms of probability

• Probability is subjective and varies between persons and over time, depending on the background information.

### Rational behaviour one person

• Axiomatic foundation of probability. Type:

• For any two events A and B exactly one of the following must hold A < B, A > B or A v B (pronounce A as more likely than B, B more likely than A, equally likely)

• If A1, A2, B1 and B2 are four events such that A1A2 = B1B2 is empty and A1> B1 and A2> B2 then A1 U A2> B1 U B2. If further either A1 > B1 or A2 > B2 then A1 U A2 > B1 U B2

• If these axioms hold all events can be assigned probabilities, which obey Kolmogorovs axioms (Villegas, Annals Math Stat, 1964),

• Axioms for behaviour. Type

• If you prefer A to B, and B to C then you must also prefer A to C

• If you want to behave rationally, then you must behave as if all events were assigned probabilities (Anscombe and Aumann, Annals Math Stat, 1963)

• Axioms for probability (these six are enough to prove that a probability following Kolmogorovs axioms can be defined plus the definition of conditional probability)

• For any two events A and B exactly one of the following must hold A < B, A > B or A v B (pronounce A as more likely than B, B more likely than A, equally likely)

• If A1, A2, B1 and B2 are four events such that A1A2 = B1B2 is empty and A1> B1 and A2> B2 then A1 U A2> B1 U B2. If further either A1 > B1 or A2 > B2 then A1 U A2 > B1 U B2

• If A is any event then A > (the impossible (empty) event)

• If Ai is an strictly decreasing sequence of events and B a fixed event such that Ai > B for all i then (the intersection of all A ii) > B

• There exists one random variable which has a uniform distribution

• For any events A, B and D, (A|D) < (B|D) if and only if AD < BD

• Then one needs some axioms about comparing outcomes, (utilities) in order to be able to prove rationality

• Further one needs some axioms about comparing outcomes, (utilities) in order to be able to prove rationality

• For any two outcomes, A and B, one either prefers A to B or B to A or is indifferent

• If you prefer A to B, and B to C then you must also prefer A to C

• If P1 and P2 are two distributions over outcomes they may be compared and you are indifferent between A and the distribution with P(A)=1

• Two measurability axioms like

• If A is any outcome and P a distribution then the event that P gives an outcome preferred two A can be compared to other events (more likely )

• If P1 is preferred to P2 and A is an event, A > 0, then the game giving P1 if A occurs is preferred to the game giving P2 under A if the results under the not-A are the same.

• If you prefer P1 to P and P to P2, then there exists numbers a>0 and b>0 such that P1 with probability 1-a and P2 with probability a is preferred to P, which is preferred to P1 with probability b and P2 with probability 1-b.

### There is only one type of numbers, which may be known or unknown.

• Classical inference has a mess of different types of numbers e.g.

• Parameters

• Latent variables like in factor analysis

• Random variables

• Observations

• Independent (explaining) variables

• Dependent variables

• Constants

• a.s.o.

• Superstition!

### Rule 1

• What you know/believe in advance + The information in the data = What you know/believe afterwards

### Rule 1

• What you know/believe in advance + The information in the data = What you know/believe afterwards

• This is described by Bayes Formula:

• P(q|K) * P(X|q,K) a P(q|X,K)

### Rule 1

• What you know/believe in advance + The information in the data = What you know/believe afterwards

• This is described by Bayes Formula:

• P(q|K) * P(X|q,K) a P(q|X,K)

• or in terms of the likelihood

• P(q|K) * L(q|X) a P(q|X,K)

### Rule 1 corrolarium

• What you believe afterwards + the information in a new study = What you believe after both studies

### Rule 1 corrolarium

• What you believe afterwards + the information in a new study = What you believe after both studies

• The result of the inference should be possible to use as an input to the next study

• It should thus be of the same form!

• Note that hypothesis testing and confidence intervals can never appear on the left hand side so they do not follow rule 1

### Rule 2

• Your knowledge must be given in a form that can be used for deciding actions. (At least in a well-formulated problem with well-defined losses/utility).

### Rule 2

• Your knowledge must be given in a form that can be used for deciding actions. (At least in a well-formulated problem with well-defined losses/utility).

• If you are rational, you must use the rule which minimizes expected losses (maximizes utility)

• Dopt = argmin E(Loss(D, q )|X,K)

= argmin X Loss(D,q) P(q |X,K) dq

### Rule 2

• Your knowledge must be given in a form that can be used. (At least in a well-formulated problem with well-defined losses/utility.

• If you are rational, you must use the rule which minimizes expected losses (maximizes utility)

• Dopt = argmin E(Loss(D, q )|X,K)

= argmin X Loss(D,q) P(q |X,K) dq

• Note that classical design-based inference has no interface with decisions.

### Statistical tests are useless

• They cannot be used to combine with new data.

• They cannot be used even in simple decision problems.

• They can be compared to the blunt plastic knife given to a three year old child

• He cannot do much sensible with it

• But he cannot harm himself either

N=4, n=2, SRS. Dichotomous data, black or white. The variable is known to come in pairs, i.e. the total is T=0, 2 or 4.

Probabilities:

### 3. An example of the the stupidity of frequency-based (design-based) methods

If you observe 1 white you know for sure that the population contains 2 white. If you observe 0 or 2 you white the only unbiased estimate is T*= 0 resp. 4

The variance of this estimate is 4/3 if T=2 (=1/6*4+4/6*0+1/6*4) and 0 if T=0 or 4

So if you know the true value the design-based variance is 4/3 and if you are uncertain the design-based variance is 0. (Standard unbiased variance estimates are 2 resp. 0)

### Bayesian analysis works OK

• We saw the Bayesian analysis when t=1, (T*=2).

• If all possibilities are equally likely priori, the posterior estimates of T when t = 0 (2) is T* = 2/7 (26/7) and the posterior variance is 24/49.

### Always stupid?

• It is always stupid to believe that the variance of an estimator is a measure of precision in one particular case. (It is defined as a long run property for many repetitions)

• But it is not always so obvious and so stupid as in this example.

• Is this a consequence of the unusual prior with T must be even?

### Example without the prior infoStill stupid but not quite as much

If you observe 1, the true error is never larger than 1, but the standard deviation is always larger than 1 for all possible parameter values.

### Always stupid?

• It is always stupid to assume that the variance of an estimator is a measure of precision in one particular case. (It is defined as a long run property for many repetitions)

• But it is not always so obvious and stupid as in these examples.

• Under suitable regularity conditions designbased methods are asymptotically as efficient as Bayesian methods

• Many people say that one should choose the approach that is best for the problem at hand. Classical or Bayesian.

• Many people say that one should choose the approach that is best for the problem at hand. Classical or Bayesian.

• So do Bayesians.

• But they also draw the conclusion:

• Many people say that one should choose the approach that is best for the problem at hand. Classical or Bayesian.

• So do Bayesians.

• But they also draw the conclusion:

• Always use Bayesian methods!

• Many people say that one should choose the approach that is best for the problem at hand. Classical or Bayesian.

• So do Bayesians.

• But they also draw the conclusion:

• Always use Bayesian methods!

• Classical methods can sometimes be seen as quick and dirty approximations to Bayesian methods.

• Then you may use them.

### 3. What is special for many statistical surveys, e.g. public statistics?

The producer of the survey is not the user.

• Often many readers and many users.

• The producer has no interest in the figures par se

• P( q |Kuser) is not known to the producer and not unique P(q| Kuser) * L(q|X) a P(q|X, Kuser)

• Solution?

### 3. What is special for many statistical surveys, e.g. public statistics?

The producer of the survey is not the user.

• Often many readers and many users.

• The producer has no interest in the figures par se

• P( q |Kuser) not known to the producer and not unique P(q| Kuser) * L(q|X) a P(q|X, Kuser)

• Publish L( q |X) so that any reader can plug in his prior

• Usually given in the form of the posterior with a vague, uninformative (often = constant) prior

L(q|X) tP(q|K0) * L(q|X) a P(q|X,K0)

### Describing the likelihood

• Estimates are often asymptotically normal. Then it is enough to give the posterior mean and variance or a (symmetric) 95% prediction interval (for large samples)

• When the maximum likelihood estimator is approximately efficient and normal the ML-estimate and inverse Fisher information are enough. (t standard confidence interval)

• Asymptotically efficient t for large samples almost as good as Bayesian estimates, which are known to be admissible also for finite samples

### What is special for many statistical surveys, e.g. public statistics?

There is no parameter or more exactly:

The parameter consists of all the N values of all the units in the population.

### What is special for many statistical surveys, e.g. public statistics?

There is no parameter or more exactly

The parameter consists of all the N values of all the units in the population.

• Use this vector as the parameter q in Bayes formula.

• If you are interested in a certain function, e.g. the total T, integrate out all nuisance parameters in the posterior, to get the marginal of interest

P(YT|X,K) = XXS qi = YT P(q1, , qN |X,K) P1N-1dqi

### 5. De Finettis theorem

• Random variables are said to be exchangeable if there is no information in the ordering. This is for instance the case with SRS

• If a sequence of random variables is infinitely exchangeable than they can be described as independent variables given q, where q is a latent random variable. (The proof is simple but needs some knowledge of random processes. Formally q is defined on the tail s-algebra)

• Latent means in this case that it does not exist but can be useful when desscribing the distribution.

• This imaginary random variable can take the place of a parameter

• But note that it does not exist (is not defined) until the full infinite sequence has been defined and the full sequence will never be observed.

• Note also that most sequences in the real world are not independent but only exchangeable. If you toss a coin 1000 times and get 800 heads it is more likely that the next toss will be heads (compared to the case with 200 heads).

• So obviously there is a dependence between the first 1000 tosses and the 1001st

### Dichotomous variables or The Polya Urn scheme

• In an urn there is one white and one black ball.

• Draw one ball at random. Note its colour.

• Put it back together with one more ball of the same colour

• Draw one at random

• This sequence can be shown to be exchangeable and it can by de Finettis theorem be described as

• Take q U(0,1) = Beta(1,1)

• Draw balls independently with this probability of being white

• There is no way to see the difference between a Bernoulli sequence (binomial distribution) with an unknown p and a Polya urn scheme. Since the outcomes follow the same distribution there cannot exist any test to differentiate between them.

• ### Dichotomous variables or The Polya Urn scheme

• We could have started with another number of balls. This had given other parameters in the prior Beta-distribution

• Beta(a,b) a white balls and b black balls

• E(Beta) = a /(a + b)

• Var(Beta) = ab /((a + b)2(a + b+1))

### Dichotomous variables or The Polya Urn scheme

• This can be used to derive the posterior distribution of the number (yT) of balls/persons with a certain property (white) in a population, given an observed SRS-sample of size n with yS white balls/persons.

• Use a prior with parameters so that the expected value is your best guess of the unknown proportion and the standard deviation describes your uncertainty about it.

### Properties

• The posterior distribution can be shown to be

• With both parameters set to 0, the expected value is Np* and the variance p*(1-p*)N(N-n)/(n+1).

• The designbased estimate and variance estimator are good approximations to this (equal apart from n in place of n+1)

### Simulation

• It is often easier to simulate from the posterior than to give its exact form.

• In this case the urn scheme gives a simple way to simulate the full population. Just continue with the Polya sampling scheme starting from the sample

• If you repeat this 1000 times, say, and plot the 1000 simulated population totals in a histogram, you will get a good description of the distribution of the unknown quantity

### Dirichlet-multinomial

• If the distribution is discrete with a finite number of categories, a similar procedure is possible

• Just draw from the set of all observations and put it back together with a similar observation. Continue until N

• Repeat and you get a number of populations which are drawn from the posterior distribution.

• For each population compute the parameter of interest, e.g. the mean or median, and plot the values in a histogram

• If this is described as in de Finettis Theorem, the parameter comes from a Dirichlet distribution and the observations are conditionally independent multinomial.

### The Bayesian Bootstrap

• This procedure is called the Bayesian bootstrap (if an uninformative prior i.e. all parameters = 0 is used)

• This can be generalised to variables measured on a continuous scale

• The design-based estimate gives the same mean estimate as this (for polytomous populations).

• The design-based variance estimator is also close to the true variance apart from a factor n/(n+1)

• Note, that if the distribution is skew, this method does not work well, since it does not use the prior information of skewness (nor does the designbased methods)

• Note also that with many categories it may be better to use even smaller parameters e.g. 0.9.

### Other Bayesian models

• There are many other models/methods within Bayesian survey sampling, than the Bayesian Bootstrap

• Another approach starts with a normal-gamma model

• Given m,s2, data come from an iid normal(m,s2) model

• The variance s2 follows a priori an inverse gamma

• The mean m follows a priori a normal model with mean m and variance ks2

• and later relaxes the normality assumption

• but I have not enough time here.

### 7. Preposterior analysisStudy/experimental design

• In the design of a survey one must take into account the posterior distribution.

• You may e.g. want to

• Get a small posterior variance

• Get a short 95 % prediction interval

• Make a good decision

• This analysis of the possible posterior consequences before the experiment is carried out, is called preposterior analysis

### Preposterior analys with public statistics

• Usually when you make a survey for your own benefit you should use your own prior both in the preposterior and the posterior analysis

• With public statistics you should have a flat prior in the posterior analysis,

• e.g. the posterior variance is Var(q |X, K0).

• But the design decision is yours and you should use all your information for that decision

• e.g. find the design, which minimizes E(Var(q |X, K0)| KYou )

### Example: Neyman allocation; Dichotomous data

• M strata with Nm elements. Unknown proportion in stratum m is pm.

• How many elements should be drawn from each stratum in order to estimate the average proportion best?

• Neyman: Chose nma Nm(pm(1-pm))1/2.

• But pm is unknown. Classical people: Use your best subjective guess pm0

• Neyman: Chose nma Nm (pm0(1-pm0))1/2.

### Bayesians: Do not use a one point prior. It is to subjective!Take also your prior uncertainty into account!

• Chose e.g. the prior Beta(am,bm). m<M

• where pm0 = am/(am+bm)

• and Var(pm) = ambm/((am+bm)2(am+bm+1))

### Example: Optimal allocation

• M strata, size Nm , dichotomous data, independent priors (am, bm) (as we saw above). Posterior variance:

• The expected value of this is

• Minimising this gives approximately the sample sizes

• The terms (am +bm) on the left hand side should not be there in the case of public statistics (c cost)

• This differs from Neyman-allocation since it takes the prior uncertainty of the proportions into account

### 8. Statistics in scienceScience is more complicated

• One may divide it into (at least) three phases

• Exploratory

• Trying to get a good picture, convincing yourself

• Proving the fact - convincing others

• These phases may require different approaches and priors

• Your own prior but critical

• .

• During work: Your own prior and often informative based on theory, arguments, experience or the exploration.

• In the presentation: Usually vague prior.

• Other possible priors (but use also vague)

### 8.1 Exploratory

• Sometimes called hypothesis generating

• Most theories are false. Most substances are useless against cancer.

• Use your own priors, which most often say that all facts are most unlikely. Some examples

• Screening all substances have a probability of 0.001 to have some effect

• Regression situations with an abundance (M) of explaining variables. If all variables are ordered after importance all M! orderings are equally likely. Given the ordering the m:th regression coefficient is N(0,1/(m-1)2) (after standardisation of X) (another possibility is that reduction in unexplained variance 1-R2 is Beta(1,m2))

• When there is a support from theory or previous experiences other priors may be used

### 8.2 Getting a good picture yourself(Assuming that you are the scientist)

• In classical terms this is the phase when you can formulate the hypotheses that you want to test. (Your prior is strong enough to formulate hypotheses)

• (In classical theory there is no description of the first phase. Mostly one said that: if you can formulate a hypothesis you may test it, as long as you do not formulate to many)

• The reporting in this phase is quite similar to what was said about official statistics. I.e. try to give a good picture of the likelihood function

• But, contrary to public statistics, in the design of experiments it is your posterior precision that should be maximised. (It is assumed that you are an expert in the field and it is no reason to believe that your opinion is far from the present state of knowledge)

### 8.3 Proving scientific facts

• It is very easy to convince people who believe in the fact from the beginning.

• It is often fairly simple to convince yourself even if you are broadminded

• But to prove a scientific fact you must convince also those that have reasonable doubts.

### Proving scientific facts

• P(q|X,K)aP(X|q,K)*P(q|K)

• A person is convinced of a fact when his posterior probability is close to one for the fact.

• But to prove the fact scientifically this must hold for all reasonable priors including those describing reasonable doubt.

• Even if there is no such person this must hold also for that prior as long as it is reasonable

• I.e. a result is proved if

inf (P(q|X,K); K reasonable) > 1 a for some a.

• Reporting: Use vague priors, but also show what the consequences are for some priors with (un-) reasonable doubt.

• When you prove something all available data should be used. Type: Meta analysis. In some fields one study is ususall not enough to convince people

• Designing experiments: Design your experiments so that you maximise E(inf (P(q|X,K); K reasonable) | KYOU) (if you are convinced).

### What is reasonable doubt? Convincing others

• You have to contemplate what is meant by reasonable doubt.

• Depends on the importance of the subject.

• It can be just putting very small prior probability on the fact to be proven

• But you must also try to find the possible flaws in your theory and designing your experiments to counterprove them.

### Priors with reasonable doubt

• Use priors with reasonable doubt

• In an experiment to prove telepathic effects you could e.g. use priors like P(logodds ratio = 0) = 0,9999. If the logodds ratio is different from 0 it may be modelled as N(0,s2), where s2 may be moderate or large.

• If the posterior e.g. says that P(p > 0,5) > 0.95, and may consider the fact as proved. (Roughly this means that you need about 20 successes in a row, where a random guess has probability to be correct).

• Never use the prior P=1, since you must always be open-minded (only fundamentalists do so. They will never change their opinion whatever the evidence).

• In more standard situations you will probably not need quite so negative priors

### Modelling flaws in the theory/study Example: Several studies needed

• An argument often met in medical studies is that no effect is proven until it is corroborated by at least three independent studies.

• This means that different conclusions must be drawn with different prior knowledge. (No, one or two previous studies).

• People arguing like that violate the Neyman-Pearson theory

• How can this be modelled in Bayesian terms?

### Some type of multilevel model

• Where m is the worldwide mean, ai is the unknown methodology bias of study number i, bi is the site-specific bias and ei is the precision of the experiment (usually with a known (posterior) distribution)

• m has an uninformative prior

• bi can often be assumed to follow normal distributions with common mean and variance following Normal-c-2-distributions.

• ai has probably prior distributions with much longer tails.

• The prior distributions for ai and bi might be estimated from other studies.

• If the prior for ai is chosen so that two similar outliers out of two trials is not impossible but three similar outliers is unlikely, we would end up in requiring three independent trials with similar results.

• There should probably also be included a selection bias. This can be done but that is too complicated for this short talk.

• In the same way the distribution of m may depend on how strict the inclusion criteria are.

• One may argue that the k trials are not independent. Studies following the same protocol may get the same bias.

### . Some situations which design-based sampling cannot handle

• Many people say that one should choose the approach that is best for the problem at hand. Many problems are more difficult than others to handle design-based

• For instance

• Missing data

• Multiple imputation

• Small area estimation

• Outlier detection

• Editing

• Meta-analysis

• Synthetic estimation

• Coding and classification

• Total survey design

### Missing data

• A model for the missingness property is needed. The following Bayesian notions are commonly used, but not everyone realises that they are Bayesian

• Missing completely at random (MCAR):

F(x,y,z,qxyz)= F(x,y,qxy)F(z,qz)

• Missing at Random (MAR): Given what you know the response mechanism is independent of the other variables F(x,y,z,qxyz)=F(y,qy|X=x)F(z,qz|X=x)F(X,qx) (where x is known, for unit non-response) (or F(y,z,qyz|y, yR)=F(y,qy|yR)F(z,qz|yR)) (with item non-response)

• Not Missing at Random (NMAR)

• X auxiliary variables, Y study variables, Z missingness indicator, parameter indexed with the variable for which it contains information.

### Multiple imputation

• Many different situations. We only look at one situation with two y-variables (but use an MCMC-technique).

• For some respondents one of the y-variables is missing, but which one differs between respondents

• We assume MAR!

• We also assume just now that (y1i,y2i) comes from a normal super-population with unknown mean q and unknown variance S.

### Multiple imputation MCMC-procedure

• Impute starting values for all missing values

• Put b=1

• Find true posterior of q and S (assuming vague normal inverse-gamma prior)

• Draw possible qb and Sb from this distribution

• Find the distribution of the missing values assuming these parameter values

• Draw new random numbers from this distribution and impute them

• Draw a random value from the conditional distribution of the sum of the non-sampled units (given parameters and imputed values.

• Add all values to get an estimate of the totals YT1b YT2b

• Save them

• If b < B0 + B set b = b+1 and go to 3 else stop

### Multiple imputation cont.

• This is an ergodic Markov chain. The distribution converges to the true distribution of T, q and S.

• Choose a burn-in period B0 so large that convergence is reached.

• The remaining B observations are thus drawn from the true posterior

• This distribution may be plotted

### Multiple imputation not fully Bayesian

1-6 As before to get an imputed data set nr b

• Estimate the total Tb (or whatever) from the sample, with standard methods and its variance, Sb

• Save them

• If b < B0 + B set b = b+1 and go to 3 else stop

• Compute the mean of the last B estimates (T). Use this as the estimate of the total

• Compute the mean of the last B variances (S)

• Compute the variance of the last B estimates (T)

• Use the sum of these two values as a variance estimator of the estimate of the total

• Note that this is a mixture of Bayesian and design-based variances. But it works as a classical estimate.

### Multiple imputation cont.

• Posterior under normality

• But what if the distribution is not normal?

### Multiple imputation cont.

• Posterior under normality

• But what if the distribution is not normal?

• The means are still a BLUE estimators of the parameter and the total and a consistent estimator of the variance

• But if the distribution is skew it will not be particularly good. This is not a defect with multiple imputation, but a problem with skew distributions in general

### Conclusions

• Always use Bayesian methods

• You will get new tools (e.g. full posterior distributions)

• You will produce something useful

• You will be logically consistent

• You will be able to tackle many more problems within the theory

• You may use design-based methods as quick and dirty methods when you know that the result will be almost equivalent to the Bayesian approach.