Bayesian essentials and bayesian regression
Download
1 / 52

Bayesian Essentials and Bayesian Regression - PowerPoint PPT Presentation


  • 460 Views
  • Updated On :

Bayesian Essentials and Bayesian Regression. Y. 1. 1. X. Distribution Theory 101. Marginal and Conditional Distributions:. uniform. Simulating from Joint. To draw from the joint: i. draw from marginal on X ii. Condition on this draw, and draw from conditional of Y|X.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Bayesian Essentials and Bayesian Regression' - omer


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Distribution theory 101 l.jpg

Y

1

1

X

Distribution Theory 101

  • Marginal and Conditional Distributions:

uniform


Simulating from joint l.jpg
Simulating from Joint

  • To draw from the joint:

    • i. draw from marginal on X

    • ii. Condition on this draw, and draw from conditional of Y|X


The goal of inference l.jpg
The Goal of Inference

  • Make inferences about unknown quantities using available information.

  • Inference -- make probability statements

  • unknowns --

    • parameters, functions of parameters, states or latent variables, “future” outcomes, outcomes conditional on an action

  • Information –

    • data-based

    • non data-based

      • theories of behavior; “subjective views” there is an underlying structure

      • parameters are finite or in some range


Data aspects of marketing problems l.jpg
Data Aspects of Marketing Problems

  • Ex: Conjoint Survey

    • 500 respondents rank, rate, chose among product configurations.

    • Small Amount of Information per respondent

    • Response Variable Discrete

  • Ex: Retail Scanning Data

    • very large number of products

    • large number of geographical units (markets, zones, stores)

    • limited variation in some marketing mix vars

    • Must make plausible predictions for decision making!


The likelihood principle l.jpg
The likelihood principle

Note: any function proportional to data density can be called the likelihood.

  • LP: the likelihood contains all information relevant for inference. That is, as long as I have same likelihood function, I should make the same inferences about the unknowns.

  • Implies analysis is done conditional on the data,.


Bayes theorem l.jpg
Bayes theorem

  • p(|D)  p(D| ) p()

  • Posterior  “Likelihood” × Prior

  • Modern Bayesian computing– simulation methods for generating draws from the posterior distribution p(|D).


Summarizing the posterior l.jpg
Summarizing the posterior

Output from Bayesian Inf:

A high dimensional dist

Summarize this object via simulation:

marginal distributions of

don’t just compute


Prediction l.jpg
Prediction

See D,compute:

“Predictive Distribution”


Decision theory l.jpg
Decision theory

  • Loss: L(a,) where a=action; =state of nature

  • Bayesian decision theory:

  • Estimation problem is a special case:


Sampling properties of bayes estimators l.jpg
Sampling properties of Bayes estimators

An estimator is admissible if there exist no other estimator with lower risk for all values of . The Bayes estimator minimizes expected (average) risk which implies they are admissible:

The Bayes estimator does the best for every D. Therefore, it must work as at least as well as any other estimator.


Bayes inference summary l.jpg
Bayes Inference: Summary

  • Bayesian Inference delivers an integrated approach to:

    • Inference – including “estimation” and “testing”

    • Prediction – with a full accounting for uncertainty

    • Decision – with likelihood and loss (these are distinct!)

  • Bayesian Inference is conditional on available info.

  • The right answer to the right question.

  • Bayes estimators are admissible. All admissible estimators are Bayes (Complete Class Thm). Which Bayes estimator?


Bayes classical estimators l.jpg
Bayes/Classical Estimators

Prior washes out – locally uniform!!! Bayes is consistent unless you have dogmatic prior.


Benefits costs of bayes inf l.jpg
Benefits/Costs of Bayes Inf

Benefits-

finite sample answer to right question

full accounting for uncertainty

integrated approach to inf/decision

“Costs”-

computational (true any more? < classical!!)

prior (cost or benefit?)

esp. with many parms

(hierarchical/non-parameric problems)


Bayesian computations l.jpg
Bayesian Computations

Before simulation methods, Bayesians used posterior expectations of various functions as summary of posterior.

If p(θ|D) is in a convenient form (e.g. normal), then I might be able to compute this for some h. Via iid simulation for all h.


Conjugate families l.jpg
Conjugate Families

  • Models with convenient analytic properties almost invariably come from conjugate families.

  • Why do I care now?

    • - conjugate models are used as building blocks

    • build intuition re functions of Bayesian inference

  • Definition:

  • A prior is conjugate to a likelihood if the posterior is in the same class of distributions as prior.

  • Basically, conjugate priors are like the posterior from some imaginary dataset with a diffuse prior.


  • Beta binomial model l.jpg

    Need a prior!

    Beta-Binomial model





    Regression model l.jpg
    Regression model

    Is this model complete? For non-experimental data, don’t we need a model for the joint distribution of y and x?


    Regression model22 l.jpg

    two separate analyses

    Regression model

    simultaneous systems are not written this way!

    rules out x=f(β)!!!

    If Ψ is a priori indep of (β,σ),


    Conjugate prior l.jpg
    Conjugate Prior

    What is conjugate prior? Comes from form of likelihood function. Here we condition on X.

    quadratic form suggests normal prior.

    Let’s – complete the square on βor rewrite by projecting y on X (column space of X).


    Geometry of regression l.jpg

    y

    x2

    2x2

    1x1

    x1

    Geometry of regression


    Traditional regression l.jpg
    Traditional regression

    • No one ever computes a matrix inverse directly.

    • Two numerically stable methods:

      • QR decomposition of X

      • Cholesky root of X’X and compute inverse using root

    • Non-Bayesians have to worry about singularity or near singularity of X’X. We don’t! more later


    Cholesky roots l.jpg
    Cholesky Roots

    In Bayesian computations, the fundamental matrix operation is the Cholesky root. chol() in R

    The Cholesky root is the generalization of the square root applied to positive definite matrices.

    As Bayesians with proper priors, we don’t ever have to worry about singular matrices!

    U is upper triangular with positive diagonal elements. U-1 is easy to compute by recursively solving TU = I for T, backsolve() in R.


    Cholesky roots27 l.jpg
    Cholesky Roots

    Cholesky roots can be useful to simulate from Multivariate Normal Distribution.

    To simulate a matrix of draws from MVN (each row is a separate draw) in R,

    Y=matrix(rnorm(n*k),ncol=k)%*%chol(Sigma)

    Y=t(t(Y)+mu)


    Regression with r l.jpg

    data.txt:

    UNIT Y X1 X2

    A 1 0.23815 0.43730

    A 2 0.55508 0.47938

    A 3 3.03399 -2.17571

    A 4 -1.49488 1.66929

    B 10 -1.74019 0.35368

    B 9 1.40533 -1.26120

    B 8 0.15628 -0.27751

    B 7 -0.93869 -0.04410

    B 6 -3.06566 0.14486

    df=read.table("data.txt",header=TRUE)

    myreg=function(y,X){

    #

    # purpose: compute lsq regression

    #

    # arguments:

    # y -- vector of dep var

    # X -- array of indep vars

    #

    # output:

    # list containing lsq coef and std errors

    #

    XpXinv=chol2inv(chol(crossprod(X)))

    bhat=XpXinv%*%crossprod(X,y)

    res=as.vector(y-X%*%bhat)

    ssq=as.numeric(res%*%res/(nrow(X)-ncol(X)))

    se=sqrt(diag(ssq*XpXinv))

    list(b=bhat,std_errors=se)

    }

    Regression with R


    Regression likelihood l.jpg

    where

    Regression likelihood


    Regression likelihood30 l.jpg
    Regression likelihood

    This is called an inverted gamma distribution. It can also be related to the inverse of a Chi-squared distribution.

    Note the conjugate prior suggested by the form the likelihood has a prior on βwhich depends on σ.


    Bayesian regression l.jpg
    Bayesian Regression

    Prior:

    Interpretation as from another dataset.

    Inverted Chi-Square:

    Draw from prior?





    Iid simulations l.jpg
    IID Simulations

    Scheme: [y|X, , 2] [|2] [2]

    1) Draw [2 | y, X]

    2) Draw [ | 2,y, X]

    3) Repeat



    Bayes estimator l.jpg
    Bayes Estimator

    The Bayes Estimator is the posterior mean of β.

    Marginal on β is a multivariate student t.

    Who cares?


    Shrinkage and conjugate priors l.jpg
    Shrinkage and Conjugate Priors

    The Bayes Estimator is the posterior mean of β.

    This is a “shrinkage” estimator.

    Is this reasonable?


    Assessing prior hyperparameters l.jpg
    Assessing Prior Hyperparameters

    These determine prior location and spread for both coefs and error variance.

    It has become customary to assess a “diffuse” prior:

    This can be problematic. Var(y) might be a better choice.


    Improper or non informative priors l.jpg
    Improper or “non-informative” priors

    Classic “non-informative” prior (improper):

    • Is this “non-informative”?

      • Of course not, it says that  is large with high prior “probability”

    • Is this wise computationally?

      • No, I have to worry about singularity in X’X

    • Is this a good procedure?

      • No, it is not admissible. Shrinkage is good!


    Runireg l.jpg
    runireg

    • runireg=

    • function(Data,Prior,Mcmc){

    • #

    • # purpose:

    • # draw from posterior for a univariate regression model with natural conjugate prior

    • #

    • # Arguments:

    • # Data -- list of data

    • # y,X

    • # Prior -- list of prior hyperparameters

    • # betabar,A prior mean, prior precision

    • # nu, ssq prior on sigmasq

    • # Mcmc -- list of MCMC parms

    • # R number of draws

    • # keep -- thinning parameter

    • #

    • # output:

    • # list of beta, sigmasq draws

    • # beta is k x 1 vector of coefficients

    • # model:

    • # Y=Xbeta+e var(e_i) = sigmasq

    • # priors: beta| sigmasq ~ N(betabar,sigmasq*A^-1)

    • # sigmasq ~ (nu*ssq)/chisq_nu


    Runireg42 l.jpg
    runireg

    • RA=chol(A)

    • W=rbind(X,RA)

    • z=c(y,as.vector(RA%*%betabar))

    • IR=backsolve(chol(crossprod(W)),diag(k))

    • # W'W=R'R ; (W'W)^-1 = IR IR' -- this is UL decomp

    • btilde=crossprod(t(IR))%*%crossprod(W,z)

    • res=z-W%*%btilde

    • s=t(res)%*%res

    • #

    • # first draw Sigma

    • #

    • sigmasq=(nu*ssq + s)/rchisq(1,nu+n)

    • #

    • # now draw beta given Sigma

    • #

    • beta = btilde + as.vector(sqrt(sigmasq))*IR%*%rnorm(k)

    • list(beta=beta,sigmasq=sigmasq)

    • }





    Inverted wishart distribution l.jpg
    Inverted Wishart distribution

    Form of the likelihood suggests that natural conjugate (convenient prior) for  would be of the Inverted Wishart form:

    denoted

    • - tightness

      V- location

      however, as  increases, spread also increases

    limitations: i. small  -- thick tail ii. only one tightness parm




    Drawing from posterior rmultireg l.jpg
    Drawing from Posterior: rmultireg

    rmultireg=

    function(Y,X,Bbar,A,nu,V)

    RA=chol(A)

    W=rbind(X,RA)

    Z=rbind(Y,RA%*%Bbar)

    # note: Y,X,A,Bbar must be matrices!

    IR=backsolve(chol(crossprod(W)),diag(k))

    # W'W = R'R & (W'W)^-1 = IRIR' -- this is the UL decomp!

    Btilde=crossprod(t(IR))%*%crossprod(W,Z)

    # IRIR'(W'Z) = (X'X+A)^-1(X'Y + ABbar)

    S=crossprod(Z-W%*%Btilde)

    #

    rwout=rwishart(nu+n,chol2inv(chol(V+S)))

    #

    # now draw B given Sigma note beta ~ N(vec(Btilde),Sigma (x) Cov)

    # Cov=(X'X + A)^-1 = IR t(IR)

    # Sigma=CICI'

    # therefore, cov(beta)= Omega = CICI' (x) IR IR' = (CI (x) IR) (CI (x) IR)'

    # so to draw beta we do beta= vec(Btilde) +(CI (x) IR)vec(Z_mk)

    # Z_mk is m x k matrix of N(0,1)

    # since vec(ABC) = (C' (x) A)vec(B), we have

    # B = Btilde + IR Z_mk CI'

    #

    B = Btilde + IR%*%matrix(rnorm(m*k),ncol=m)%*%t(rwout$CI)


    Conjugacy is fragile l.jpg
    Conjugacy is Fragile!

    SUR:

    set of regressions “related” via correlated errors

    BUT, no joint conjugate prior!!


    ad