1 / 30

Additional Topics in Prediction Methodology

Additional Topics in Prediction Methodology. Introduction. Predictive distribution for random variable Y 0 is meant to capture all the information about Y 0 that is contained in Y n .

robyn
Download Presentation

Additional Topics in Prediction Methodology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Additional Topics in Prediction Methodology

  2. Introduction • Predictive distribution for random variable Y0 is meant to capture all the information about Y0 that is contained in Yn. • not completely specify Y0 but does provide a probability distribution of more likely and less likely values of Y0 • E{Y0|Yn} is the best MSPE predictor of Y0

  3. Hierarchical models have two stages • X Rd • f0=f(x0) known p*1 vector • F=(fj(xj)) known n*p matrix •  unknown p*1 vector regression coefficients • R=(R(xi-xj)) known n*n matrix correlations among trainning data Yn • r0=(R(xi-x0)) known n*1 vector correlations of Y0 with Yn

  4. Predictive Distributions when Z2, R and r0 are known

  5. Interesting features of (a) and (b) • Non-informative Prior is the limit of the normal prior as  • While the prior is non-informative, it is not a proper distribution. The corresponding predictive distribution is proper. • The same conditioning argument can be applied to drive posterior mean for the non-informative prior and normal prior.

  6. The mean and variance of the predictive distribution (mean) • 0|n(x0) and  0|n(x0) depend on x0 only through the regression function f0 and correlation vector r0 • 0|n(x0) is a linear unbiased predictor of Y(x0) • The continuity and other smoothness properties of 0|n(x0) are inherited from correlation function R(.) and the regressors {f(.)}j=1p

  7. 0|n(x0) depends on the parameters z2 2 only through their ratio • 0|n(x0) interpolate the training data. When x0=xi, f0=f(xi), and r0TR-1=eiT, the ith unit vector.

  8. The mean and variance of the predictive distribution (Variance) • MSPE(0|n(x0) )=  0|n2(x0) • The variance of the posterior of Y(x0) given Yn should be 0 whenever x0=xi  0|n2(xi)=0

  9. Most important use of Theorem 4.1.1

  10. Predictive Distributions when R and r0 are known The posterior is a location shifted and scaled univariate t distribution having degrees of freedom that are enhanced when there is informative prior information for either  or z2

  11. Degree of freedom • Base value for the degree of freedom i=n-p • P additional degrees of freedom when prior  is informative • 0 additional degree of freedom when z2 is informative

  12. Location shift The same centering value as Theorem 4.1.1 (known z2) The non-informative prior gives the BLUP

  13. Scale factor i2(x0) (compare 4.1.15 with 4.1.6) • Estimate of the scale factor 0|n2(x0). • Qi2/i : estimate z2 • Qi2: get information about z2 from the conditional distribution Yn given z2 and information from the prior of z2 • i2(xi)=0, xi is any of the training data points.

  14. Prediction Distributions when Correlation parameters are unknown • If the correlations among the observations is unknown (R r0 are unknown)? • Assume y(.) has a Gaussian prior with correlation function R(.|),  is unknown vector parameters • Two issues • Standard error of Plug-in predictor 0|n(x0|) by substituting  comes from MLE or REML • Bayesian approach to uncertainty in  which is to model it by a prior distribution

  15. Prediction of Multiple Response Models • Several outputs are available for from a computer experiment • Several codes are available for computing the same response (fast and slow code) • Competing response • Several stochastic models for joint response • Using these models to describe the optimal predictor for one of the several computed responses.

  16. Modeling Multiple Outputs • Zi(.): marginally mean zero stationary Gaussian stochastic processes with unknown variance and correlation function R • Zi(x) implies that the correlation between Zi(x1) and Zi(x2) only depends on x1-x2 • Assume Cov(Zi(x1), Zj(x2))=ijRij(x1-x2) • Rij(.) cross-correlation function of Zi(.) and Zj(.) • Linear model: global mean of the Yi process. fi(.): known regression functions • i: unknown regression parameters

  17. Selection of correlation and cross-correlation functions are complicated • Reason: for any input sites xli, the multivariate normal distributed random vector (Z1(x11), ….)T must have a nonnegative definite covariance matrix • Solution: construct the Zi(.) from a set of elementary processes (usually this processes are mutually independent)

  18. Example by Kennedy and O’Hagan • Yi(x): prior for the ith code level (i=m top-level code). The autoregressive model: • Yi(x)=i-1Yi-1(x)+i(x), i=2, … , m • The output for each successive higher level code i at x is related to the output of the less precise code i-1 at x plus the refinement i(x) • Cov(Yi(x), Yi-1(w)|Yi-1(x))=0 for all w~=x • No additional second-order knowledge of code i at x can be obtained from the lower-level code i-1 if the value of code i-1 at x is known (Markov property on the hierarchy of codes) • Since there is no natural hierarchy of computer code in such applications, we need find something better.

  19. More reasonable Model • Each constraint function is associated with the objective function plus a refinement • Yi(x)=iY1(x)+i(x), i=2, … , m+1 • Ver Hoef and Marry • Form models in the environmental sciences • Include an unknown smooth surface plus a random measurement error. • Moving averages over white noise processes

  20. Morris and Mitchell model • Prior information about y(x) is specified by a Gaussian processor Y(.) • Prior information about the partial derivatives y(j)(x) is obtained by considering the “derivative” processes of Y(.) • Y1(.)=y(.), y2(.)= y(1)(.), y1+m(.)=y(m)(.) • Natural prior for y(j)(x): • The covariances between Y(x1), Y(j)(x2) and Y(i)(x1), Y(j)(x2) are:

  21. Optimal Predictors for Multiple Outputs • The best MSPE predictor based on training data is: • Where Y0=Y1(X0), Yini=(Yi(x1i), …), and yini is observed value for i=[1,m]

  22. The joint distribution is the multivariate normal distribution

  23. Conditional expectation ….. • In practice, this is useless (it requires knowledge of marginal correlation functions, joint correlation function and ratio of all the process variance) • Empirical versions are of practical use: • Every time we assume each of the correlation matrices Ri and cross-correlation matrices Rij are known up to a vector of parameters. • Estimate  using MLE or REML

  24. example1 • 14 point training data has feature that it allows us to learn over the entire input space: space-filling • Compare two model • Using the predictor of y(.) based on y(.) alone • Using the predictor of y(.) base on (y(.), y(1)(.), y(2)(.)) • Second one is both more visually fit and has 24% smaller ERMSPE

  25. Thank you!

More Related