Predicting Output from Computer Experiments

Predicting Output from Computer Experiments Design and Analysis of Computer Experiments Chapter 3 Kevin Leyton-Brown

Overview • Overall program in this chapter: • predict the output of a computer simulation • we’re going to review approaches to regression, looking for various kinds of optimality • First, we’ll talk about just predicting our random variable (x 3.2) • note, in this setting, we have no “features” • Then, we’ll consider the inclusion of features in our predictions, based on features in our training data (x 3.2, 3.3) • In the end, we’ll apply these ideas to computer experiments (x 3.3) • Not covered: • an empirical evaluation of seven EBLUPs on small-sample data (x 3.3, pp. 69-81) • proofs of some esoteric BLUP theorems (x 3.4, pp. 82-84) • If you’ve done the reading you already know: • the difference between “minimum MSPE linear unbiased predictors” and BLUPs • three different “intuitive” interpretations of r>0R-1(Yn – FB) • a lot about statistics • whether this chapter has anything to do with computer experiments • If you haven’t you’re in for a treat 

Predictors • Y0 is our random variable; our data is Yn = (Y1, …, Yn)> • no “features” – just predict one response from the others • A generic predictor predicts Y0 based on Yn • to avoid powerpoint agony, I’ll denote as Y0 from now on • There are three kinds of predictors discussed: • “Predictors”: • Y0(Yn) has unrestricted functional form • “Linear Predictors”: • Y0 = a0 + ni=1aiYi = a0 + a>Yn • “Linear unbiased predictors (LUP): • again, linear predictors Y0 = a0 + a>Yn • furthermore, “unbiased” with respect to a given family F of distributions for (Y0, Yn) • Definition: a predictor Y0 is unbiased for Y0 with respect to the class of distributions F over (Y0, Yn) if for all F 2F, EF{Y0} = EF{Y0}. • EF denotes expectation under the F(¢) distribution for (Y0, Yn) • this definition depends on F: a linear predictor is unbiased with respect to a class • as F gets bigger, the set of LUPs gets weakly smaller

LUP Example 1 • Suppose that Yi = 0 + i, where i» N(0,2), 2 > 0. • Define F as those distributions in which • 0 is a given nonzero constant • 2 is unknown, but 2 > 0 is known • Any Y0 = a0 + aTYn is a LP of Y0 • Which are unbiased? We know that: • E {Y0} = E {a0 + ni=1aiYi} = a0 + 0ni=1ai (Eq 1) • and E {Y0} = 0 (Eq 2) • For our LP to be unbiased, we must have (Eq 1) = (Eq 2) 82 • since (Eq 1), (Eq 2) are independent of 2, we just need that, given 0, a satisfies a0 + 0ni=1ai = 0 • solutions: • a0 = 0, a such that ni=1ai = 0 (data-independent predictor Y0 = 0) • a0 = 0, a such that ni=1ai = 1 • e.g., sample mean of Yn is the LUP corresponding to a0 = 0, ai = 1/n

LUP Example 2 • Suppose that Yi = 0 + i, where i» N(0,2), 2 > 0. • Define F as those distributions in which • 0 is an unknown real constant • 2 is unknown, but 2 > 0 is known • Any Y0 = a0 + aTYn is a LP of Y0 • Which are unbiased? We know that: • E {Y0} = E {a0 + ni=1aiYi} = a0 + 0ni=1ai (Eq 1) • and E {Y0} = 0 (Eq 2) • For our LP to be unbiased, we must have (Eq 1) = (Eq 2) 82and 80 • since (Eq 1), (Eq 2) are independent of 2, we just need that, 80, a satisfies a0 + 0ni=1ai = 0 • solutions: • a0 = 0, a such that ni=1ai = 0 (data-independent predictor Y0 = 0) • a0 = 0, a such that ni=1ai = 1 • e.g., sample mean of Yn is the LUP corresponding to a0 = 0, ai = 1/n • This illustrates that a LUP for F is a LUP for subfamilies of F

Best Mean Squared Prediction Error (MSPE) Predictors • Definition: MSPE(Y0,F) ´ EF{(Y0 - Y0)2} • Definition:Y0 is a minimum MSPE predictor at F if, for any predictor Y0* MSPE(Y0,F) · MSPE(Y0*,F) • we’ll also call this a best MSPE predictor • “Fundamental theorem of prediction”: • the conditional mean of Y0 given Yn is the minimum MSPE predictor of Y0 based on Yn

Best Mean Squared Prediction Error (MSPE) Predictors • Theorem: Suppose that (Y0, Yn) has a joint distribution F for which the conditional mean of Y0 given Yn exists. Then Y0 = E{Y0 | Yn} is the best MSPE predictor of Y0. • Proof: Fix an arbitrary unbiased predictor Y0*(Yn). • MSPE(Y0*,F) = EF{(Y0* - Y0)2} = EF{(Y0* - Y0 + Y0 - Y0)2} = EF{(Y0* - Y0)2} + MSPE(Y0,F) + 2EF{(Y0* - Y0)(Y0 - Y0)}¸ MSPE(Y0,F) + 2EF{(Y0* - Y0)(Y0 - Y0)} (Eq 3) • EF{(Y0* - Y0)(Y0 - Y0)} = EF{(Y0* - Y0) EF{(Y0 - Y0) | Yn}} = EF{(Y0* - Y0) (Y0 - EF{Y0 | Yn})} = EF{(Y0* - Y0) £ 0} = 0 • Thus, MSPE(Y0*,F) ¸ MSPE(Y0,F) ¥ • Notes: • Y0 = E{Y0 | Yn} is essentially the unique MSPE predictor • MSPE(Y0*,F) = MSPE(Y0,F) iff Y0 = Y0* almost everywhere • Y0 = E{Y0 | Yn} is always unbiased: • E{Y0} = E{E{Y0 | Yn}} = E{Y0} (Why can we condition here?)

Example: Continued-Best MSPE Predictors • What is the best MSPE predictor when each Yi» N(0, 2)? • Since the Yi’s are independent, [Y0 | Yn] = N(0, 2) • Thus, Y0 = E{Y0|Yn} = 0 • What if 2 is known, and Yi» N(0, 2), but 0 is unknown (i.e., [0]  1)? • improper priors do not always give proper posteriors. But here:[Y0 | Yn = yn] » N1 [, 2(1 + 1/n)]where  is the sample mean on the training data Yn • Thus, the best MSPE predictor of Y0 is Y0 = (iYi) / n

Now let’s dive in to Gaussian Processes (uh oh…) • Consider the regression model from chapter 2:Yi´Y(xi) = pj=1fjj + Z(xi) = f>(xi) + Z(xi) • each fj is a known regression function •  is an unknown nonzero p £ 1 vector • Z(x) is a zero mean stationary Gaussian process with dependence specified byCov{Z(xi),Z(xj)} = 2Z R(xi - xj) for some known correlation function R. • Then the joint distribution of Y0 = Y(x0) and Yn = (Y(x1), …, Y(xn)) is the def’n of unbiased and the conditional dist. of a multivariate normal give (Eq 4)

Gaussian Process Example Continued • The best MSPE predictor of Y0 isY0 = E{Y0 | Yn} = f>0 + r>0R-1(Yn - F) (Eq4) • …But for what class of distributions F is this true? • Y0 depends on: • multivariate normality of (Y0, Yn) •  • R(¢) • thus the best MSPE predictor changes when  or R change, however, it remains the same for all 2Z > 0

Second GP example • Second example: analogous to the previous linear example, what if we add uncertainty about ? • we assume that 2Z is known, although the authors say this isn’t required • Now we have a two-stage model: • The first stage, our conditional distribution of (Y0, Yn) given , is the same distribution we saw before. • The second stage is our prior on . • One can show that the best MSPE predictor of Y0 isY0 = E{Y0 | Yn} = f>0 E{ | Yn} + r>0R-1(Yn - F E{ | Yn}) • Compare this to what we had in the one-stage case:Y0 = f>0 + r>0R-1(Yn - F) • but the authors give a derivation; see the book

So what about E{ | Yn}? • Of course, the formula for E{ | Yn} depends on our  prior • when this prior is uninformative, we can derive[ | Yn] » Np[(F>R-1F)-1F-1Yn, 2Z(F>R-1F)-1] • this (somehow) gives us Y0 = f>0B + r>0R-1(Yn – FB), (Eq 5) B = (F>R-1F)-1F>R-1Yn * as above with Y0, for powerpoint reasons I use B instead of • What sense can we make of (Eq 5)? • the sum of the regression predictor f>0B and a “correction” r>0R-1(Yn – FB) • a function of the training data Yn • a function of x0, the point at which a prediction is made • recall that f>0´f(x0)>; r>0´ (R(x0 - x1), …, R(x0 - xn))> • For the moment, we consider (1); we consider (2) and (3) in x 3.3 • (that’s right, we’re still in x 3.2!) • The correction term is a linear combination of the residuals Yn – FB based on the GP model f> + Z with prediction point specific coefficients:r>0R-1(Yn – FB) = i ci(x0)(Yn - FB)wherethe weight ci(x0) is the ith element of R-1r0 and (Yn - FB) is the ith residual based on the fitted model

Example • Suppose the true unknown curve is the 1D dampened cosine: y(x) = e-1.4xcos(7x/2) • 7-point training set • x1 drawn from [0,1/7] • xi = x1 + (i-1)/7 • Consider predicting y using a stationary GP Y(x) = 0 + Z(x) • Z has zero mean, variance 2Z, correlation function R(h) = e-136.1h2 • F is a 7 £ 1 column vector of ones • i.e., we have no features, just an intercept 0 • Using the regression/correction interpretation of (Eq 5), we can write:Y(x0) = B0 + 7i=1 ci(x0)(Yi - B0) • ci(x0) is the ith element of R-1r0 • (Yi - B0) are the residuals from fitting the constant model

Example continued • Consider y(x0) at x0 = 0.55 (plotted as a cross below) • The residuals (Yi - B0) and their associated weights ci(x0) are plotted below • Note: • weights can be positive or negative • the correction to the regression B0 is based primarily on the residuals at the training data points closest to x0 • the weights for the 3 furthest training instances are indistinguishable from zero • y(0.55) has interpolated the data • what does the whole curve look like? • We need to wait for x 3.3 to find out…

…but I’ll show you now anyway!

Interpolating the data • The correction term r>0R-1(Yn – FB) forces the model to interpolate the data • suppose x0 is xi for some i 2 {1, …, n} • then f0 = f>(xi), and • r0> = (R(xi - x1), …, R(xi - xn))>, which is the ith row of R • Because R-1r0 is the ith column of R-1R = In, the identity matrix, thus R-1r0 = (0, …, 0,1,0, …, 0)> = ei, the ith unit vector • Hence:r>0R-1(Yn – FB) = ei> (Yn – FB) = Yi - f>(xi)B • and soY(x0) = f>(xi)B + (Yi - f>(xi)B) = Yi (Eq 5),

An example showing that best MSPE predictors need not be linear • Suppose that (Y0, Y1) has the joint distribution: • Then the conditional distribution of Y0 given Y1 = y1 is uniform over the interval (0, y12). • The best MSPE predictor of Y0 is the center of this interval:Y0 = E{Y0 | Y1} = Y12/2 • The minimum MSPE linear unbiased predictor is Y0L = -1/12 + ½ Y1 • based on a bunch of calculus • Their MSPEs are very similar: • E{(Y0 - Y12/2)2}  0.01667 • E{(Y0 - -1/12 + ½ Y1)2}  0.01806

Best Linear Unbiased MSPE Predictors • minimum MSPE predictors depend on the joint distribution of Yn and Y0 • thus, they tend to be optimal within a very restricted class F • In an attempt to find predictors that are more broadly optimal, consider: • predictors that are linear in Yn; • these are called best linear predictors (BLPs) • predictors that are both linear and unbiased for Y0. • these are called best linear unbiased predictors (BLUPs)

BLUP Example 1 • Recall our first example: • Yi = 0 + i, where i» N(0,2), 2 > 0. • Define F as those distributions in which • 0 is a given nonzero constant • 2 is unknown, but 2 > 0 is known • Any Y0 = a0 + a>Yn is a LUP of Y0 if a0 + 0ni=1ai = 0 • The MSPE of a linear unbiased predictor Y0 = a0 + a>Yn is • E {(a0 + ni=1aiYi - Y0)2} = E{(a0 + iai(0 + i) - 0 - 0)2} = (a0 + 0i ai - 0)2 + 2i ai2 + 2 = 2 (1+ i ai2) (Eq 6)¸2 (Eq 7) • we have equality in (Eq 6) because Y0 is unbiased • we have equality in (Eq 7) iff ai = 0, i 2 {1, …, n} (and hence a0 = 0) • Thus, the unique BLUP is Y0 = 0

BLUP Example 2 • Consider again the enlarged model F with 0 as an unknown real; 2 > 0 • recall that every unbiased Y0 = a0 + a>Yn must satisfy a0 = 0 and i ai = 1 • The MSPE of Y0 isE{(i aiYi - Y0)2} = (0i ai - 0)2 + 2i ai2 + 2 = 0 + 2 (1 + i ai2) (Eq 8)¸2 (1 + 1/n) (Eq 9) • equality holds in (Eq 8) because i ai = 1 • (Eq 9): i ai2 is minimized under i ai = 1 when ai = 1/n • Thus the sample mean Y0 = 1/n iYi is the best linear unbiased predictor of Y0 for the enlarged F. • How can the BLUP for a large class not also the BLUP for a subclass?(didn’t we see a claim to the contrary earlier)? • the previous claim was that every LUP for a class is also a LUP for a subclass, but it doesn’t hold for BLUPs.

BLUP Example 3 • Consider the measurement error model:Yi = Y(xi) = jfj(xi) j + iwhere the f are known regression functions, the  are unknown, and each i» N(0, 2) • Consider the BLUP of Y(x0) for unknown real 0 and 2 > 0 • A linear predictor Y0 = a0 + aTYn is unbiased provided that for all (, 2) E{a0 + aTYn} = a0 + aTF is equal to E{Y0} = f>(x0) • This implies a0 = 0 and F>a = f(x0) • The BLUP of Y0 is Y0 = f>(x0)B • where B = (F>F)-1F>Yn is the ordinary least squares estimator of  • and the BLUP is unique • This is proved in the chapter notes, x3.4. • …and now we’ve reached the end of x3.2!

…that’s all for today!

Prediction for Computer Experiments • The idea is to build a “surrogate” or “simulator” • a model that predicts the output of a simulation, to spare you from having to run the actual simulation • Neural networks, splines, GPs all work—guess what, they like GPs • Let f1, …, fp be known regression functions,  be a vector of unknown regression coefficients, Z be a stationary GP on X having zero mean, variance 2Z, correlation function R. • Then we can see experimental output Y(x) as the realization of the random function • This model implies that Y0 and Yn have the multivariate normal distribution where  and 2Z > 0 are unknown • Now, drop the Gaussian assumption to consider a nonparametric moment model based on an arbitrary second-order stationary process for unknown  and 2Z:

Conclusion: is this the right tool when we can get lots of data?

Predicting Output from Computer Experiments - Chapter 3 Overview