Create Presentation
Download Presentation

Download Presentation
## Kevin Murphy UBC CS & Stats 9 February 2005

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Why I am a Bayesian(and why you should become one,**too)orClassical statistics considered harmful Kevin Murphy UBC CS & Stats 9 February 2005**Where does the title come from?**• “Why I am not a Bayesian”, Glymour, 1981 • “Why Glymour is a Bayesian”, Rosenkrantz, 1983 • “Why isn’t everyone a Bayesian?”,Efron, 1986 • “Bayesianism and causality, or, why I am only a half-Bayesian”, Pearl, 2001 Many other such philosophical essays…**Prob = objective relative frequencies**Params are fixed unknown constants, so cannot write e.g. P(=0.5|D) Estimators should be good when averaged across many trials Prob = degrees of belief (uncertainty) Can write P(anything|D) Estimators should be good for the available data Frequentist vs Bayesian Source: “All of statistics”, Larry Wasserman**Outline**• Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach?**Coin flipping**HHTHT HHHHH What process produced these sequences? The following slides are from Tenenbaum & Griffiths**statistical**models Hypotheses in coin flipping Describe processes by which D could be generated • Fair coin, P(H) = 0.5 • Coin with P(H) = p • Markov model • Hidden Markov model • ... D = HHTHT**generative**models Hypotheses in coin flipping Describe processes by which D could be generated • Fair coin, P(H) = 0.5 • Coin with P(H) = p • Markov model • Hidden Markov model • ... D = HHTHT**Graphical model notation**Pearl (1988), Jordan (1998) Variables are nodes, edges indicate dependency Directed edges show causal process of data generation d1d2 d3 d4 Fair coin, P(H) = 0.5 d1d2 d3 d4 Markov model HHTHT d1d2 d3 d4 d5 Representing generative models**p**d1d2 d3 d4 P(H) = p s1s2 s3 s4 HHTHT d1d2 d3 d4 Hidden Markov model d1d2 d3 d4 d5 Models with latent structure • Not all nodes in a graphical model need to be observed • Some variables reflect latent structure, used in generating D but unobserved How do we select the “best” model?**Likelihood**Prior probability Posterior probability Bayes’ rule Sum over space of hypotheses**The origin of Bayes’ rule**• A simple consequence of using probability to represent degrees of belief • For any two random variables:**Why represent degrees of belief with probabilities?**• Good statistics • consistency, and worst-case error bounds. • Cox Axioms • necessary to cohere with common sense • “Dutch Book” + Survival of the Fittest • if your beliefs do not accord with the laws of probability, then you can always be out-gambled by someone whose beliefs do so accord. • Provides a theory of incremental learning • a common currency for combining prior knowledge and the lessons of experience.**Hypotheses in Bayesian inference**• Hypotheses H refer to processes that could have generated the data D • Bayesian inference provides a distribution over these hypotheses, given D • P(D|H) is the probability of D being generated by the process identified by H • Hypotheses H are mutually exclusive: only one process could have generated D**Coin flipping**• Comparing two simple hypotheses • P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses • P(H) = 0.5 vs. P(H) = p**Coin flipping**• Comparing two simple hypotheses • P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses • P(H) = 0.5 vs. P(H) = p**Comparing two simple hypotheses**• Contrast simple hypotheses: • H1: “fair coin”, P(H) = 0.5 • H2:“always heads”, P(H) = 1.0 • Bayes’ rule: • With two hypotheses, use odds form**Bayes’ rule in odds form**= x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Prior odds Posterior odds Bayes factor(likelihood ratio)**Data = HHTHT**= x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHTHT H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25P(H1) = 999/1000 P(D|H2) = 0 P(H2) = 1/1000 P(H1|D) / P(H2|D) = infinity**Data = HHHHH**= x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D) 30**Data = HHHHHHHHHH**= x P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHHHHHHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/210 P(H1) = 999/1000 P(D|H2) = 1P(H2) = 1/1000 P(H1|D) / P(H2|D) 1**Coin flipping**• Comparing two simple hypotheses • P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses • P(H) = 0.5 vs. P(H) = p**p**d1d2 d3 d4 d1d2 d3 d4 Fair coin, P(H) = 0.5 P(H) = p Comparing simple and complex hypotheses • Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p? vs.**Comparing simple and complex hypotheses**• P(H) = p is more complex than P(H) = 0.5 in two ways: • P(H) = 0.5 is a special case of P(H) = p • for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5**Comparing simple and complex hypotheses**Probability**Comparing simple and complex hypotheses**Probability HHHHH p = 1.0**Comparing simple and complex hypotheses**Probability HHTHT p = 0.6**Comparing simple and complex hypotheses**• P(H) = p is more complex than P(H) = 0.5 in two ways: • P(H) = 0.5 is a special case of P(H) = p • for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5 • How can we deal with this? • frequentist: hypothesis testing • information theorist: minimum description length • Bayesian: just use probability theory!**Comparing simple and complex hypotheses**P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x**Comparing simple and complex hypotheses**P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x Marginal likelihood likelihood Prior**Likelihood and prior**• Likelihood: P(D | p) = pNH(1-p)NT • NH: number of heads • NT: number of tails • Prior: P(p) pFH-1 (1-p)FT-1 ?**A simple method of specifying priors**• Imagine some fictitious trials, reflecting a set of previous experiences • strategy often used with neural networks • e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair • In fact, this is a sensible statistical idea...**Likelihood and prior**• Likelihood: P(D | p) = pNH(1-p)NT • NH: number of heads • NT: number of tails • Prior: P(p) pFH-1 (1-p)FT-1 • FH: fictitious observations of heads • FT: fictitious observations of tails Beta(FH,FT) (pseudo-counts)**Posterior / prior x likelihood**• Prior • Likelihood • Posterior Same form!**Conjugate priors**• Exist for many standard distributions • formula for exponential family conjugacy • Define prior in terms of fictitious observations • Beta is conjugate to Bernoulli (coin-flipping) FH = FT = 1 FH = FT = 3 FH = FT = 1000**Normalizing constants**• Prior • Normalizing constant for Beta distribution • Posterior • Hence marginal likelihood is**Comparing simple and complex hypotheses**P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x Likelihood for H1 Marginal likelihood (“evidence”) for H2**Marginal likelihood for H1 and H2**Probability Marginal likelihood is an average over all values of p**Bayesian model selection**• Simple and complex hypotheses can be compared directly using Bayes’ rule • requires summing over latent variables • Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor” • Maximum likelihood cannot be used for model selection (always prefers hypothesis with largest number of parameters)**Outline**• Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach?**Example: Belgian euro-coins**• A Belgian euro spun N=250 times came up heads X=140. • “It looks very suspicious to me. If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%” – Barry Blight, LSE (reported in Guardian, 2002) Source: Mackay exercise 3.15**Classical hypothesis testing**• Null hypothesis H0 eg. q = 0.5 (unbiased coin) • For classical analysis, don’t need to specify alternative hypothesis, but later we will useH1: 0.5 • Need a decision rule that maps data D to accept/ reject of H0. • Define a scalar measure of deviance d(D) from the null hypothesis e.g., Nh or 2**P-values**• Define p-value of threshold as • Intuitively, p-value of data is probability of getting data at least that extreme given H0**P-values**R • Define p-value of threshold as • Intuitively, p-value of data is probability of getting data at least that extreme given H0 • Usually choose so that false rejection rate of H0 is below significance level = 0.05**P-values**R • Define p-value of threshold as • Intuitively, p-value of data is probability of getting data at least that extreme given H0 • Usually choose so that false rejection rate of H0 is below significance level = 0.05 • Often use asymptotic approximation to distribution of d(D) under H0 as N !1**P-value for euro coins**• N = 250 trials, X=140 heads • P-value is “less than 7%” • If N=250 and X=141, pval = 0.0497, so we can reject the null hypothesis at the significance level of 5%. • This does not mean P(H0|D)=0.07! Pval=(1-binocdf(139,n,0.5)) + binocdf(110,n,0.5)**Bayesian analysis of euro-coin**• Assume P(H0)=P(H1)=0.5 • Assume P(p) ~ Beta(,) • Setting =1 yields a uniform (non-informative) prior.**Bayesian analysis of euro-coin**• If =1,so H0 (unbiased) is (slightly) more probable than H1 (biased). • By varying over a large range, the best we can do is make B=1.9, which does not strongly support the biased coin hypothesis. • Other priors yield similar results. • Bayesian analysis contradicts classical analysis.**Outline**• Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach?**Outline**• Hypothesis testing – Bayesian approach • Hypothesis testing – classical approach • What’s wrong the classical approach? • Violates likelihood principle • Violates stopping rule principle • Violates common sense