Bayesian Essentials and Bayesian Regression

1 / 52

# Bayesian Essentials and Bayesian Regression - PowerPoint PPT Presentation

Bayesian Essentials and Bayesian Regression. Y. 1. 1. X. Distribution Theory 101. Marginal and Conditional Distributions:. uniform. Simulating from Joint. To draw from the joint: i. draw from marginal on X ii. Condition on this draw, and draw from conditional of Y|X.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Bayesian Essentials and Bayesian Regression' - omer

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Bayesian Essentials and Bayesian Regression

Y

1

1

X

Distribution Theory 101
• Marginal and Conditional Distributions:

uniform

Simulating from Joint
• To draw from the joint:
• i. draw from marginal on X
• ii. Condition on this draw, and draw from conditional of Y|X
The Goal of Inference
• Make inferences about unknown quantities using available information.
• Inference -- make probability statements
• unknowns --
• parameters, functions of parameters, states or latent variables, “future” outcomes, outcomes conditional on an action
• Information –
• data-based
• non data-based
• theories of behavior; “subjective views” there is an underlying structure
• parameters are finite or in some range
Data Aspects of Marketing Problems
• Ex: Conjoint Survey
• 500 respondents rank, rate, chose among product configurations.
• Small Amount of Information per respondent
• Response Variable Discrete
• Ex: Retail Scanning Data
• very large number of products
• large number of geographical units (markets, zones, stores)
• limited variation in some marketing mix vars
• Must make plausible predictions for decision making!
The likelihood principle

Note: any function proportional to data density can be called the likelihood.

• LP: the likelihood contains all information relevant for inference. That is, as long as I have same likelihood function, I should make the same inferences about the unknowns.
• Implies analysis is done conditional on the data,.
Bayes theorem
• p(|D)  p(D| ) p()
• Posterior  “Likelihood” × Prior
• Modern Bayesian computing– simulation methods for generating draws from the posterior distribution p(|D).
Summarizing the posterior

Output from Bayesian Inf:

A high dimensional dist

Summarize this object via simulation:

marginal distributions of

don’t just compute

Prediction

See D,compute:

“Predictive Distribution”

Decision theory
• Loss: L(a,) where a=action; =state of nature
• Bayesian decision theory:
• Estimation problem is a special case:
Sampling properties of Bayes estimators

An estimator is admissible if there exist no other estimator with lower risk for all values of . The Bayes estimator minimizes expected (average) risk which implies they are admissible:

The Bayes estimator does the best for every D. Therefore, it must work as at least as well as any other estimator.

Bayes Inference: Summary
• Bayesian Inference delivers an integrated approach to:
• Inference – including “estimation” and “testing”
• Prediction – with a full accounting for uncertainty
• Decision – with likelihood and loss (these are distinct!)
• Bayesian Inference is conditional on available info.
• The right answer to the right question.
• Bayes estimators are admissible. All admissible estimators are Bayes (Complete Class Thm). Which Bayes estimator?
Bayes/Classical Estimators

Prior washes out – locally uniform!!! Bayes is consistent unless you have dogmatic prior.

Benefits/Costs of Bayes Inf

Benefits-

finite sample answer to right question

full accounting for uncertainty

integrated approach to inf/decision

“Costs”-

computational (true any more? < classical!!)

prior (cost or benefit?)

esp. with many parms

(hierarchical/non-parameric problems)

Bayesian Computations

Before simulation methods, Bayesians used posterior expectations of various functions as summary of posterior.

If p(θ|D) is in a convenient form (e.g. normal), then I might be able to compute this for some h. Via iid simulation for all h.

Conjugate Families
• Models with convenient analytic properties almost invariably come from conjugate families.
• Why do I care now?
• - conjugate models are used as building blocks
• build intuition re functions of Bayesian inference
• Definition:
• A prior is conjugate to a likelihood if the posterior is in the same class of distributions as prior.
• Basically, conjugate priors are like the posterior from some imaginary dataset with a diffuse prior.
Regression model

Is this model complete? For non-experimental data, don’t we need a model for the joint distribution of y and x?

two separate analyses

Regression model

simultaneous systems are not written this way!

rules out x=f(β)!!!

If Ψ is a priori indep of (β,σ),

Conjugate Prior

What is conjugate prior? Comes from form of likelihood function. Here we condition on X.

Let’s – complete the square on βor rewrite by projecting y on X (column space of X).

y

x2

2x2

1x1

x1

Geometry of regression
• No one ever computes a matrix inverse directly.
• Two numerically stable methods:
• QR decomposition of X
• Cholesky root of X’X and compute inverse using root
• Non-Bayesians have to worry about singularity or near singularity of X’X. We don’t! more later
Cholesky Roots

In Bayesian computations, the fundamental matrix operation is the Cholesky root. chol() in R

The Cholesky root is the generalization of the square root applied to positive definite matrices.

As Bayesians with proper priors, we don’t ever have to worry about singular matrices!

U is upper triangular with positive diagonal elements. U-1 is easy to compute by recursively solving TU = I for T, backsolve() in R.

Cholesky Roots

Cholesky roots can be useful to simulate from Multivariate Normal Distribution.

To simulate a matrix of draws from MVN (each row is a separate draw) in R,

Y=matrix(rnorm(n*k),ncol=k)%*%chol(Sigma)

Y=t(t(Y)+mu)

data.txt:

UNIT Y X1 X2

A 1 0.23815 0.43730

A 2 0.55508 0.47938

A 3 3.03399 -2.17571

A 4 -1.49488 1.66929

B 10 -1.74019 0.35368

B 9 1.40533 -1.26120

B 8 0.15628 -0.27751

B 7 -0.93869 -0.04410

B 6 -3.06566 0.14486

myreg=function(y,X){

#

# purpose: compute lsq regression

#

# arguments:

# y -- vector of dep var

# X -- array of indep vars

#

# output:

# list containing lsq coef and std errors

#

XpXinv=chol2inv(chol(crossprod(X)))

bhat=XpXinv%*%crossprod(X,y)

res=as.vector(y-X%*%bhat)

ssq=as.numeric(res%*%res/(nrow(X)-ncol(X)))

se=sqrt(diag(ssq*XpXinv))

list(b=bhat,std_errors=se)

}

Regression with R
Regression likelihood

This is called an inverted gamma distribution. It can also be related to the inverse of a Chi-squared distribution.

Note the conjugate prior suggested by the form the likelihood has a prior on βwhich depends on σ.

Bayesian Regression

Prior:

Interpretation as from another dataset.

Inverted Chi-Square:

Draw from prior?

IID Simulations

Scheme: [y|X, , 2] [|2] [2]

1) Draw [2 | y, X]

2) Draw [ | 2,y, X]

3) Repeat

Bayes Estimator

The Bayes Estimator is the posterior mean of β.

Marginal on β is a multivariate student t.

Who cares?

Shrinkage and Conjugate Priors

The Bayes Estimator is the posterior mean of β.

This is a “shrinkage” estimator.

Is this reasonable?

Assessing Prior Hyperparameters

These determine prior location and spread for both coefs and error variance.

It has become customary to assess a “diffuse” prior:

This can be problematic. Var(y) might be a better choice.

Improper or “non-informative” priors

Classic “non-informative” prior (improper):

• Is this “non-informative”?
• Of course not, it says that  is large with high prior “probability”
• Is this wise computationally?
• No, I have to worry about singularity in X’X
• Is this a good procedure?
• No, it is not admissible. Shrinkage is good!
runireg
• runireg=
• function(Data,Prior,Mcmc){
• #
• # purpose:
• # draw from posterior for a univariate regression model with natural conjugate prior
• #
• # Arguments:
• # Data -- list of data
• # y,X
• # Prior -- list of prior hyperparameters
• # betabar,A prior mean, prior precision
• # nu, ssq prior on sigmasq
• # Mcmc -- list of MCMC parms
• # R number of draws
• # keep -- thinning parameter
• #
• # output:
• # list of beta, sigmasq draws
• # beta is k x 1 vector of coefficients
• # model:
• # Y=Xbeta+e var(e_i) = sigmasq
• # priors: beta| sigmasq ~ N(betabar,sigmasq*A^-1)
• # sigmasq ~ (nu*ssq)/chisq_nu
runireg
• RA=chol(A)
• W=rbind(X,RA)
• z=c(y,as.vector(RA%*%betabar))
• IR=backsolve(chol(crossprod(W)),diag(k))
• # W\'W=R\'R ; (W\'W)^-1 = IR IR\' -- this is UL decomp
• btilde=crossprod(t(IR))%*%crossprod(W,z)
• res=z-W%*%btilde
• s=t(res)%*%res
• #
• # first draw Sigma
• #
• sigmasq=(nu*ssq + s)/rchisq(1,nu+n)
• #
• # now draw beta given Sigma
• #
• beta = btilde + as.vector(sqrt(sigmasq))*IR%*%rnorm(k)
• list(beta=beta,sigmasq=sigmasq)
• }
Inverted Wishart distribution

Form of the likelihood suggests that natural conjugate (convenient prior) for  would be of the Inverted Wishart form:

denoted

• - tightness

V- location

however, as  increases, spread also increases

limitations: i. small  -- thick tail ii. only one tightness parm

Drawing from Posterior: rmultireg

rmultireg=

function(Y,X,Bbar,A,nu,V)

RA=chol(A)

W=rbind(X,RA)

Z=rbind(Y,RA%*%Bbar)

# note: Y,X,A,Bbar must be matrices!

IR=backsolve(chol(crossprod(W)),diag(k))

# W\'W = R\'R & (W\'W)^-1 = IRIR\' -- this is the UL decomp!

Btilde=crossprod(t(IR))%*%crossprod(W,Z)

# IRIR\'(W\'Z) = (X\'X+A)^-1(X\'Y + ABbar)

S=crossprod(Z-W%*%Btilde)

#

rwout=rwishart(nu+n,chol2inv(chol(V+S)))

#

# now draw B given Sigma note beta ~ N(vec(Btilde),Sigma (x) Cov)

# Cov=(X\'X + A)^-1 = IR t(IR)

# Sigma=CICI\'

# therefore, cov(beta)= Omega = CICI\' (x) IR IR\' = (CI (x) IR) (CI (x) IR)\'

# so to draw beta we do beta= vec(Btilde) +(CI (x) IR)vec(Z_mk)

# Z_mk is m x k matrix of N(0,1)

# since vec(ABC) = (C\' (x) A)vec(B), we have

# B = Btilde + IR Z_mk CI\'

#

B = Btilde + IR%*%matrix(rnorm(m*k),ncol=m)%*%t(rwout\$CI)

Conjugacy is Fragile!

SUR:

set of regressions “related” via correlated errors

BUT, no joint conjugate prior!!