- By
**omer** - Follow User

- 460 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Bayesian Essentials and Bayesian Regression' - omer

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Simulating from Joint

- To draw from the joint:
- i. draw from marginal on X
- ii. Condition on this draw, and draw from conditional of Y|X

The Goal of Inference

- Make inferences about unknown quantities using available information.
- Inference -- make probability statements
- unknowns --
- parameters, functions of parameters, states or latent variables, “future” outcomes, outcomes conditional on an action
- Information –
- data-based
- non data-based
- theories of behavior; “subjective views” there is an underlying structure
- parameters are finite or in some range

Data Aspects of Marketing Problems

- Ex: Conjoint Survey
- 500 respondents rank, rate, chose among product configurations.
- Small Amount of Information per respondent
- Response Variable Discrete
- Ex: Retail Scanning Data
- very large number of products
- large number of geographical units (markets, zones, stores)
- limited variation in some marketing mix vars
- Must make plausible predictions for decision making!

The likelihood principle

Note: any function proportional to data density can be called the likelihood.

- LP: the likelihood contains all information relevant for inference. That is, as long as I have same likelihood function, I should make the same inferences about the unknowns.
- Implies analysis is done conditional on the data,.

Bayes theorem

- p(|D) p(D| ) p()
- Posterior “Likelihood” × Prior
- Modern Bayesian computing– simulation methods for generating draws from the posterior distribution p(|D).

Summarizing the posterior

Output from Bayesian Inf:

A high dimensional dist

Summarize this object via simulation:

marginal distributions of

don’t just compute

Decision theory

- Loss: L(a,) where a=action; =state of nature
- Bayesian decision theory:
- Estimation problem is a special case:

Sampling properties of Bayes estimators

An estimator is admissible if there exist no other estimator with lower risk for all values of . The Bayes estimator minimizes expected (average) risk which implies they are admissible:

The Bayes estimator does the best for every D. Therefore, it must work as at least as well as any other estimator.

Bayes Inference: Summary

- Bayesian Inference delivers an integrated approach to:
- Inference – including “estimation” and “testing”
- Prediction – with a full accounting for uncertainty
- Decision – with likelihood and loss (these are distinct!)
- Bayesian Inference is conditional on available info.
- The right answer to the right question.
- Bayes estimators are admissible. All admissible estimators are Bayes (Complete Class Thm). Which Bayes estimator?

Bayes/Classical Estimators

Prior washes out – locally uniform!!! Bayes is consistent unless you have dogmatic prior.

Benefits/Costs of Bayes Inf

Benefits-

finite sample answer to right question

full accounting for uncertainty

integrated approach to inf/decision

“Costs”-

computational (true any more? < classical!!)

prior (cost or benefit?)

esp. with many parms

(hierarchical/non-parameric problems)

Bayesian Computations

Before simulation methods, Bayesians used posterior expectations of various functions as summary of posterior.

If p(θ|D) is in a convenient form (e.g. normal), then I might be able to compute this for some h. Via iid simulation for all h.

Conjugate Families

- Models with convenient analytic properties almost invariably come from conjugate families.
- Why do I care now?
- - conjugate models are used as building blocks
- build intuition re functions of Bayesian inference
- Definition:
- A prior is conjugate to a likelihood if the posterior is in the same class of distributions as prior.
- Basically, conjugate priors are like the posterior from some imaginary dataset with a diffuse prior.

Regression model

Is this model complete? For non-experimental data, don’t we need a model for the joint distribution of y and x?

Regression model

simultaneous systems are not written this way!

rules out x=f(β)!!!

If Ψ is a priori indep of (β,σ),

Conjugate Prior

What is conjugate prior? Comes from form of likelihood function. Here we condition on X.

quadratic form suggests normal prior.

Let’s – complete the square on βor rewrite by projecting y on X (column space of X).

Traditional regression

- No one ever computes a matrix inverse directly.
- Two numerically stable methods:
- QR decomposition of X
- Cholesky root of X’X and compute inverse using root
- Non-Bayesians have to worry about singularity or near singularity of X’X. We don’t! more later

Cholesky Roots

In Bayesian computations, the fundamental matrix operation is the Cholesky root. chol() in R

The Cholesky root is the generalization of the square root applied to positive definite matrices.

As Bayesians with proper priors, we don’t ever have to worry about singular matrices!

U is upper triangular with positive diagonal elements. U-1 is easy to compute by recursively solving TU = I for T, backsolve() in R.

Cholesky Roots

Cholesky roots can be useful to simulate from Multivariate Normal Distribution.

To simulate a matrix of draws from MVN (each row is a separate draw) in R,

Y=matrix(rnorm(n*k),ncol=k)%*%chol(Sigma)

Y=t(t(Y)+mu)

data.txt:

UNIT Y X1 X2

A 1 0.23815 0.43730

A 2 0.55508 0.47938

A 3 3.03399 -2.17571

A 4 -1.49488 1.66929

B 10 -1.74019 0.35368

B 9 1.40533 -1.26120

B 8 0.15628 -0.27751

B 7 -0.93869 -0.04410

B 6 -3.06566 0.14486

df=read.table("data.txt",header=TRUE)

myreg=function(y,X){

#

# purpose: compute lsq regression

#

# arguments:

# y -- vector of dep var

# X -- array of indep vars

#

# output:

# list containing lsq coef and std errors

#

XpXinv=chol2inv(chol(crossprod(X)))

bhat=XpXinv%*%crossprod(X,y)

res=as.vector(y-X%*%bhat)

ssq=as.numeric(res%*%res/(nrow(X)-ncol(X)))

se=sqrt(diag(ssq*XpXinv))

list(b=bhat,std_errors=se)

}

Regression with RRegression likelihood

This is called an inverted gamma distribution. It can also be related to the inverse of a Chi-squared distribution.

Note the conjugate prior suggested by the form the likelihood has a prior on βwhich depends on σ.

Bayesian Regression

Prior:

Interpretation as from another dataset.

Inverted Chi-Square:

Draw from prior?

Bayes Estimator

The Bayes Estimator is the posterior mean of β.

Marginal on β is a multivariate student t.

Who cares?

Shrinkage and Conjugate Priors

The Bayes Estimator is the posterior mean of β.

This is a “shrinkage” estimator.

Is this reasonable?

Assessing Prior Hyperparameters

These determine prior location and spread for both coefs and error variance.

It has become customary to assess a “diffuse” prior:

This can be problematic. Var(y) might be a better choice.

Improper or “non-informative” priors

Classic “non-informative” prior (improper):

- Is this “non-informative”?
- Of course not, it says that is large with high prior “probability”
- Is this wise computationally?
- No, I have to worry about singularity in X’X
- Is this a good procedure?
- No, it is not admissible. Shrinkage is good!

runireg

- runireg=
- function(Data,Prior,Mcmc){
- #
- # purpose:
- # draw from posterior for a univariate regression model with natural conjugate prior
- #
- # Arguments:
- # Data -- list of data
- # y,X
- # Prior -- list of prior hyperparameters
- # betabar,A prior mean, prior precision
- # nu, ssq prior on sigmasq
- # Mcmc -- list of MCMC parms
- # R number of draws
- # keep -- thinning parameter
- #
- # output:
- # list of beta, sigmasq draws
- # beta is k x 1 vector of coefficients
- # model:
- # Y=Xbeta+e var(e_i) = sigmasq
- # priors: beta| sigmasq ~ N(betabar,sigmasq*A^-1)
- # sigmasq ~ (nu*ssq)/chisq_nu

runireg

- RA=chol(A)
- W=rbind(X,RA)
- z=c(y,as.vector(RA%*%betabar))
- IR=backsolve(chol(crossprod(W)),diag(k))
- # W\'W=R\'R ; (W\'W)^-1 = IR IR\' -- this is UL decomp
- btilde=crossprod(t(IR))%*%crossprod(W,z)
- res=z-W%*%btilde
- s=t(res)%*%res
- #
- # first draw Sigma
- #
- sigmasq=(nu*ssq + s)/rchisq(1,nu+n)
- #
- # now draw beta given Sigma
- #
- beta = btilde + as.vector(sqrt(sigmasq))*IR%*%rnorm(k)
- list(beta=beta,sigmasq=sigmasq)
- }

Inverted Wishart distribution

Form of the likelihood suggests that natural conjugate (convenient prior) for would be of the Inverted Wishart form:

denoted

- - tightness

V- location

however, as increases, spread also increases

limitations: i. small -- thick tail ii. only one tightness parm

Drawing from Posterior: rmultireg

rmultireg=

function(Y,X,Bbar,A,nu,V)

RA=chol(A)

W=rbind(X,RA)

Z=rbind(Y,RA%*%Bbar)

# note: Y,X,A,Bbar must be matrices!

IR=backsolve(chol(crossprod(W)),diag(k))

# W\'W = R\'R & (W\'W)^-1 = IRIR\' -- this is the UL decomp!

Btilde=crossprod(t(IR))%*%crossprod(W,Z)

# IRIR\'(W\'Z) = (X\'X+A)^-1(X\'Y + ABbar)

S=crossprod(Z-W%*%Btilde)

#

rwout=rwishart(nu+n,chol2inv(chol(V+S)))

#

# now draw B given Sigma note beta ~ N(vec(Btilde),Sigma (x) Cov)

# Cov=(X\'X + A)^-1 = IR t(IR)

# Sigma=CICI\'

# therefore, cov(beta)= Omega = CICI\' (x) IR IR\' = (CI (x) IR) (CI (x) IR)\'

# so to draw beta we do beta= vec(Btilde) +(CI (x) IR)vec(Z_mk)

# Z_mk is m x k matrix of N(0,1)

# since vec(ABC) = (C\' (x) A)vec(B), we have

# B = Btilde + IR Z_mk CI\'

#

B = Btilde + IR%*%matrix(rnorm(m*k),ncol=m)%*%t(rwout$CI)

Conjugacy is Fragile!

SUR:

set of regressions “related” via correlated errors

BUT, no joint conjugate prior!!

Download Presentation

Connecting to Server..