1 / 52

# Bayesian Essentials and Bayesian Regression - PowerPoint PPT Presentation

Bayesian Essentials and Bayesian Regression. Y. 1. 1. X. Distribution Theory 101. Marginal and Conditional Distributions:. uniform. Simulating from Joint. To draw from the joint: i. draw from marginal on X ii. Condition on this draw, and draw from conditional of Y|X.

Related searches for Bayesian Essentials and Bayesian Regression

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Bayesian Essentials and Bayesian Regression' - omer

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Bayesian Essentials and Bayesian Regression

1

1

X

Distribution Theory 101

• Marginal and Conditional Distributions:

uniform

• To draw from the joint:

• i. draw from marginal on X

• ii. Condition on this draw, and draw from conditional of Y|X

• Make inferences about unknown quantities using available information.

• Inference -- make probability statements

• unknowns --

• parameters, functions of parameters, states or latent variables, “future” outcomes, outcomes conditional on an action

• Information –

• data-based

• non data-based

• theories of behavior; “subjective views” there is an underlying structure

• parameters are finite or in some range

• Ex: Conjoint Survey

• 500 respondents rank, rate, chose among product configurations.

• Small Amount of Information per respondent

• Response Variable Discrete

• Ex: Retail Scanning Data

• very large number of products

• large number of geographical units (markets, zones, stores)

• limited variation in some marketing mix vars

• Must make plausible predictions for decision making!

Note: any function proportional to data density can be called the likelihood.

• LP: the likelihood contains all information relevant for inference. That is, as long as I have same likelihood function, I should make the same inferences about the unknowns.

• Implies analysis is done conditional on the data,.

• p(|D)  p(D| ) p()

• Posterior  “Likelihood” × Prior

• Modern Bayesian computing– simulation methods for generating draws from the posterior distribution p(|D).

Output from Bayesian Inf:

A high dimensional dist

Summarize this object via simulation:

marginal distributions of

don’t just compute

See D,compute:

“Predictive Distribution”

• Loss: L(a,) where a=action; =state of nature

• Bayesian decision theory:

• Estimation problem is a special case:

An estimator is admissible if there exist no other estimator with lower risk for all values of . The Bayes estimator minimizes expected (average) risk which implies they are admissible:

The Bayes estimator does the best for every D. Therefore, it must work as at least as well as any other estimator.

• Bayesian Inference delivers an integrated approach to:

• Inference – including “estimation” and “testing”

• Prediction – with a full accounting for uncertainty

• Decision – with likelihood and loss (these are distinct!)

• Bayesian Inference is conditional on available info.

• The right answer to the right question.

• Bayes estimators are admissible. All admissible estimators are Bayes (Complete Class Thm). Which Bayes estimator?

Prior washes out – locally uniform!!! Bayes is consistent unless you have dogmatic prior.

Benefits-

finite sample answer to right question

full accounting for uncertainty

integrated approach to inf/decision

“Costs”-

computational (true any more? < classical!!)

prior (cost or benefit?)

esp. with many parms

(hierarchical/non-parameric problems)

Before simulation methods, Bayesians used posterior expectations of various functions as summary of posterior.

If p(θ|D) is in a convenient form (e.g. normal), then I might be able to compute this for some h. Via iid simulation for all h.

• Models with convenient analytic properties almost invariably come from conjugate families.

• Why do I care now?

• - conjugate models are used as building blocks

• build intuition re functions of Bayesian inference

• Definition:

• A prior is conjugate to a likelihood if the posterior is in the same class of distributions as prior.

• Basically, conjugate priors are like the posterior from some imaginary dataset with a diffuse prior.

• Beta-Binomial model

Is this model complete? For non-experimental data, don’t we need a model for the joint distribution of y and x?

Regression model

simultaneous systems are not written this way!

rules out x=f(β)!!!

If Ψ is a priori indep of (β,σ),

What is conjugate prior? Comes from form of likelihood function. Here we condition on X.

Let’s – complete the square on βor rewrite by projecting y on X (column space of X).

x2

2x2

1x1

x1

Geometry of regression

• No one ever computes a matrix inverse directly.

• Two numerically stable methods:

• QR decomposition of X

• Cholesky root of X’X and compute inverse using root

• Non-Bayesians have to worry about singularity or near singularity of X’X. We don’t! more later

In Bayesian computations, the fundamental matrix operation is the Cholesky root. chol() in R

The Cholesky root is the generalization of the square root applied to positive definite matrices.

As Bayesians with proper priors, we don’t ever have to worry about singular matrices!

U is upper triangular with positive diagonal elements. U-1 is easy to compute by recursively solving TU = I for T, backsolve() in R.

Cholesky roots can be useful to simulate from Multivariate Normal Distribution.

To simulate a matrix of draws from MVN (each row is a separate draw) in R,

Y=matrix(rnorm(n*k),ncol=k)%*%chol(Sigma)

Y=t(t(Y)+mu)

UNIT Y X1 X2

A 1 0.23815 0.43730

A 2 0.55508 0.47938

A 3 3.03399 -2.17571

A 4 -1.49488 1.66929

B 10 -1.74019 0.35368

B 9 1.40533 -1.26120

B 8 0.15628 -0.27751

B 7 -0.93869 -0.04410

B 6 -3.06566 0.14486

myreg=function(y,X){

#

# purpose: compute lsq regression

#

# arguments:

# y -- vector of dep var

# X -- array of indep vars

#

# output:

# list containing lsq coef and std errors

#

XpXinv=chol2inv(chol(crossprod(X)))

bhat=XpXinv%*%crossprod(X,y)

res=as.vector(y-X%*%bhat)

ssq=as.numeric(res%*%res/(nrow(X)-ncol(X)))

se=sqrt(diag(ssq*XpXinv))

list(b=bhat,std_errors=se)

}

Regression with R

Regression likelihood

This is called an inverted gamma distribution. It can also be related to the inverse of a Chi-squared distribution.

Note the conjugate prior suggested by the form the likelihood has a prior on βwhich depends on σ.

Prior:

Interpretation as from another dataset.

Inverted Chi-Square:

Draw from prior?

Scheme: [y|X, , 2] [|2] [2]

1) Draw [2 | y, X]

2) Draw [ | 2,y, X]

3) Repeat

The Bayes Estimator is the posterior mean of β.

Marginal on β is a multivariate student t.

Who cares?

The Bayes Estimator is the posterior mean of β.

This is a “shrinkage” estimator.

Is this reasonable?

These determine prior location and spread for both coefs and error variance.

It has become customary to assess a “diffuse” prior:

This can be problematic. Var(y) might be a better choice.

Classic “non-informative” prior (improper):

• Is this “non-informative”?

• Of course not, it says that  is large with high prior “probability”

• Is this wise computationally?

• No, I have to worry about singularity in X’X

• Is this a good procedure?

• No, it is not admissible. Shrinkage is good!

• runireg=

• function(Data,Prior,Mcmc){

• #

• # purpose:

• # draw from posterior for a univariate regression model with natural conjugate prior

• #

• # Arguments:

• # Data -- list of data

• # y,X

• # Prior -- list of prior hyperparameters

• # betabar,A prior mean, prior precision

• # nu, ssq prior on sigmasq

• # Mcmc -- list of MCMC parms

• # R number of draws

• # keep -- thinning parameter

• #

• # output:

• # list of beta, sigmasq draws

• # beta is k x 1 vector of coefficients

• # model:

• # Y=Xbeta+e var(e_i) = sigmasq

• # priors: beta| sigmasq ~ N(betabar,sigmasq*A^-1)

• # sigmasq ~ (nu*ssq)/chisq_nu

• RA=chol(A)

• W=rbind(X,RA)

• z=c(y,as.vector(RA%*%betabar))

• IR=backsolve(chol(crossprod(W)),diag(k))

• # W'W=R'R ; (W'W)^-1 = IR IR' -- this is UL decomp

• btilde=crossprod(t(IR))%*%crossprod(W,z)

• res=z-W%*%btilde

• s=t(res)%*%res

• #

• # first draw Sigma

• #

• sigmasq=(nu*ssq + s)/rchisq(1,nu+n)

• #

• # now draw beta given Sigma

• #

• beta = btilde + as.vector(sqrt(sigmasq))*IR%*%rnorm(k)

• list(beta=beta,sigmasq=sigmasq)

• }

Form of the likelihood suggests that natural conjugate (convenient prior) for  would be of the Inverted Wishart form:

denoted

• - tightness

V- location

however, as  increases, spread also increases

limitations: i. small  -- thick tail ii. only one tightness parm

Prior:

Posterior:

rmultireg=

function(Y,X,Bbar,A,nu,V)

RA=chol(A)

W=rbind(X,RA)

Z=rbind(Y,RA%*%Bbar)

# note: Y,X,A,Bbar must be matrices!

IR=backsolve(chol(crossprod(W)),diag(k))

# W'W = R'R & (W'W)^-1 = IRIR' -- this is the UL decomp!

Btilde=crossprod(t(IR))%*%crossprod(W,Z)

# IRIR'(W'Z) = (X'X+A)^-1(X'Y + ABbar)

S=crossprod(Z-W%*%Btilde)

#

rwout=rwishart(nu+n,chol2inv(chol(V+S)))

#

# now draw B given Sigma note beta ~ N(vec(Btilde),Sigma (x) Cov)

# Cov=(X'X + A)^-1 = IR t(IR)

# Sigma=CICI'

# therefore, cov(beta)= Omega = CICI' (x) IR IR' = (CI (x) IR) (CI (x) IR)'

# so to draw beta we do beta= vec(Btilde) +(CI (x) IR)vec(Z_mk)

# Z_mk is m x k matrix of N(0,1)

# since vec(ABC) = (C' (x) A)vec(B), we have

# B = Btilde + IR Z_mk CI'

#

B = Btilde + IR%*%matrix(rnorm(m*k),ncol=m)%*%t(rwout\$CI)

SUR:

set of regressions “related” via correlated errors

BUT, no joint conjugate prior!!