Multivariate Statistics

Multivariate Statistics Regression Analysis W. M. van der Veld University of Amsterdam

Overview • Digression: Galton revisited • Types of regression • Goals of regression • Spurious effects • Simple regression • Prediction • Fitting a line • OLS estimation • Assessment of the fit (R2) • Assumptions • Confidence intervals of the estimates • The matrix approach • Multiple Regression • Assumptions • OLS estimation • Assessment of the fit (R2)

Digression: Galton revisited

Digression: Galton revisited • Galton was inspired by the work of his cousin (who’s that?) • In one his ‘early’ studies, Galton looked at the development of size and weight of sweet pea seeds over two generations. • This lead to the conclusion that size and weight have a tendency to regress toward the mean, called ‘reversion’ by Galton. • i.e., produced offspring is less extreme than ancestors. • Galton assumed that the same process would hold for human beings. • Indeed, he found that the physical characteristics of thousands of human volunteers showed a similar regression toward mediocrity in human hereditary stature. • It is ironic, that today the term ‘regression’ is still used based on its role in history. Even the term reversion is more appropriate. But in contemporary statistics, there is an more accurate description of the process that is described. • Which is?

Types of regression

Goals of regression

Goals of regression: Prediction • x1 = Scholastic achievement (at elementary school) • y1 = Choice of secondary school (VMBO, MAVO, HAVO, VWO)

Goals of regression: Prediction • x1 = Scholastic achievement (at elementary school) • x2 = Quality of elementary school • y1 = Choice of secondary school (VMBO, MAVO, HAVO, VWO)

Goals of regression: Causation • x1 = Socio-economic status • x2 = Quality of elementary school • y1 = Choice of secondary school (VMBO, MAVO, HAVO, VWO) • y2 = Scholastic achievement (at elementary school)

Spurious effects The effect of confounding variables

Spurious effects • Below is the ‘true’ model, • Where x2 is the confounding variable • And the relationship between y1 and y2 is partly spurious.

Spurious effects • Say we estimate the effect of y2 on y1. Using the simplest model. • What will that effect be? (draw the model) • Now three examples of the effect of a confounding variable

Spurious effects [1] • Say we want to estimate the effect of y2 on y1, taking into account the effect of quality of elementary school. What will that effect be? (draw the model)

Spurious effects [2] • Suppose the correlations have changed. What will that effect be in this case?

Spurious effects [3] • Suppose the correlations have changed. What will that effect be in this case?

Spurious effects • It should be obvious, that some relationships are spurious. • In order to get unbiased estimates of the effects, we need to include the confounding variables. • In our example the observed relation was 0.25 • Situation 1: The causal effect almost disappeared (0.05) • Situation 2: The causal effect almost doubled (0.45) • Situation 3: The causal effect changed sign (-0.25) • This is related to the assumption that the error terms are unrelated with each other or with other variables in the model.

Simple regression

Prediction • Let f(x) be given: y = 0.5 + 2x • What can we tell about y? • That y equals 0 when x equals 0. • That y increases 2 points when x increases 1 point. • And also that whenx = 0 => y = 0.5x = 1 => y = 2.5x = 2 => y = 4.5et cetera, for all values of x within the range of possible x values. • What’s the use of knowing this?

Prediction • Let’s say x is the number of hours of sunshine today, andy is the amount of sunshine tomorrow (in hours). • We can PREDICT y under the condition that we know x. • And we know x because we can measure the number of hours of of sunshine today. • Of course, we know from the weather forecasting that the causal mechanisms are more complex. • Nevertheless, suppose that in 50% of the predictions we are correct, then who cares that real life is more complex.

Fitting a line • Let’s start all over again. • The regression equation is: y = α + βx + ε • Where: • α is called the intercept, • β is called the regression coefficient, • ε is a random variable indicating the error in the equation, and • y is a random variable. • α and β are unknown population parameters. • We must first know them, before we can start to make predictions or even more (causality). • We can estimate α and β if we have observations for x and y. • How?

On the right there are 25 observations of an x and a y variable. We can plot the observations in a (2-dimensional) plane. This way we can test whether the relationship is linear. If not, linear regression is not the way. Fitting a line

This relationship is pretty linear. We have already drawn the regression line through the observed points. It can be seen that none of the observations is on the prediction line. Still this prediction line has some properties which makes it the best line. Fitting a line

OLS estimation • So how do we determine the best line?i.e. under what conditions must α and β be estimated. • By making the sum of errors as small as possible. • But because some errors are negative and others positive (so they can equal each other out), we determine the best line by • Making the sum of squared errors as small as possible: min(Σε2) . • There are other possibilities (not discussed). • The error is the result of a wrong prediction, thus when we • predict that ŷ=4, and • in fact the observed y=5, than • the error=1 • Thus: ε = y – ŷ • Note that we don’t know ε nor ŷ.

OLS estimation • Let us write the prediction equation as: ŷ = a + bx. • Note that in the prediction there is no error, it’s a prediction! • The error stems from the observed value of y. • ε = y – ŷ = y – (a + bx), • We are looking for the values of ε that minimize the sum of squares. • min(Σε2 ) =min(Σ(y – (a + bx))2 ) =min(Σ(y – a – bx)2 )min(g(x)), where g(x) = Σ(y – a – bx)2 • This means that we have to find the (partial) derivative and solve for a and b while the derivative is zero.

OLS estimation • The minimum of g(x) = Σ(y – a – bx)2 is found by solving the derivative when equal to zero. • The parameters of the model are the unknowns: a & b • The partial derivatives are: • ∂g/∂a = 2Σ(y – a – bx)*-1 = -2Σ(y – a – bx) • ∂g/∂b = 2Σ(y – a – bx)*-x = -2Σx(y – a – bx) • Setting them to zero and solving the equations results in:

Digression: terminology

Digression: deviation scores • These formulas can be rewritten as follows, if the variables are expressed in deviation scores:

OLS estimation • Using the 25 observations of X and Y, we can estimate both α and β. • You could compute both a and b as exercise at home if you like:b = -0.079829a = 9.4240 - -0.079829*52.60 = 13.623005 • So, the estimated regression equation is:ŷ = a + bx = 13.623005 - 0.079829x • These estimates are computed using the criterion that Σε2 should be minimized. • From that criterion (OLS) it follows that Σε should be 0. • ε = y – ŷ, we already had y and we can calculate ŷ. • Is it true that Σε=0?i.e. does our estimation procedure yield parameter estimates with this property?

OLS estimation • We have computed all predicted y’s. • It can be seen that in all 25 cases we make prediction errors, i.e. yi – ŷi≠ 0 • If we sum over the last column, then we get: Σε=-0.02. • This deviation is due to rounding errors.

Assessment of fit • How good is the prediction of y with x? • One possibility the size of the average error. • Not good since Σε=0; hence 1/nΣε is also 0. • The error is written as: ε = y – ŷ, which is the same as:

Assessment of fit • SST is the variation of y [=n*var(y)]. • SST is the sum SSR and SSE • If there are no prediction errors, SSE will be zero! • In that case all observations will be on the prediction line, and we will have a perfect prediction. • An indication of how good the prediction is, is therefore given as: SSR/SST. • This quotient is called R2.

Assumptions • Up to this point we have made no assumptions at all that involve probability distributions. • A number of specified algebraic calculations have been made, but that’s all. • We now make the basic assumptions that in the model that are needed for correct OLS estimation:yi = α + βx + εi where i=1,2,….,n • (1) εi is a random variable with E(εi) = 0 and var(εi) = σ2 • (2) cov(εi,εj) = 0 where i≠j • Thus: E(yi) = E(α + βxi + εi) = E(α + βxi) = α + βxi • Where α and β are constants, and xi for a given case i is fixed. • Thus: var(yi) = var(α + βxi + εi) = var(εi) = σ2 • Where α and β are constants, and xi for a given case i is fixed.

Assumptions • (3) cov(εi,xi) = 0, no spurious relations • For statistical tests it is necessary to also assume the following: • (4) εi is normally distributed with mean zero and variance σ2.So εi ~ N(0, σ2) • Under this assumption εi and εj are not only uncorrelated, but necessarily independent.

Confidence intervals of the estimates • Both equations provide OLS estimates of the parameters of the (simple) regression model. But, how close are these estimates to the population values α and β? • Due to sampling variation our estimates deviate from the true population values. • We are interested in the interval in which the true value should lie, given our estimate. • This interval is unfortunately called the confidence interval. • Due to some odd circumstances (Fisher?) we generally use the arbitrary value of 0.95 to define the limits of our interval (95%).

Confidence intervals of the estimates • This might lead to the wrong idea that there is a 0.95 probability that the true population value lies within this interval. • The meaning of a confidence interval is however:If samples of the same size are drawn repeatedly from a population, and a confidence interval is calculated from each sample, then 95% of these intervals should contain the population mean. • So we are talking here about variations due to sampling. • How do we determine sampling variance of the parameters a and b? • Let us start with b.

Confidence intervals of the estimates • The equation for the estimation of b is:b = Σ[xi - E(x)][yi - E(y)] / Σ[xi - E(x)]2 • From that can be derived that the variance of b due to sampling errors equals:var(b) = σ2/Σ[xi - E(x)]2 • We can also compute the standard deviation of b, which is the square root of the variance of b. • This statistic is normally called the standard error. • The standard error of b will be: s.e.(b) = √[var(b)] = σ / √[Σ[xi - E(x)]2 • Normally σ is unknown so we use the estimate s, assuming that the model is correct, i.e. y = ŷ. • Therefore: est. s.e.(b) = s / √[Σ[xi - E(x)]2] • This is an expression for the sampling variation.

Confidence intervals of the estimates • This expression of the standard error allows the computation of a confidence interval, so that we get an idea about the closeness of b to β. • If it is assumed that the errors are all from the same normal distribution (assumption 3), then the confidence interval can be computed with: • b ± t(n-2,1-½ α) [est. s.e.(b)] • Where t is the t-value from a t-distribution with k degrees of freedom and a specified α-level. In this case the estimate of s2 has n-2 degrees of freedom. And normally α is chosen to be 0.05. • It is clear that the smaller the standard error, the closer b is to β.

Confidence intervals of the estimates • Is it possible to test whether an estimate of b is equal to some value B? • Yes, we can test whether b and B are statistically different from each other. • t = (b-B) / [est. s.e.(b)] • If |t| < t(n-2,1-½ α), then this difference is said to be statistically not significant.. • The test that is usually done in statistical packages is the test whether the parameter is different from zero (t=b/[est. s.e.(b)], i.e. whether zero is included in the interval.

Confidence intervals of the estimates • Now the sampling variation of a. • The equation for the estimation of a is:a = E(y)-bE(x) • From that can be derived that:var(a) = σ2Σxi2/n Σ[xi - E(x)]2 • The standard error of a: • s.e.(a) = σ [√Σxi2/ n Σ[xi - E(x)]2) • Normally σ is unknown so we use the estimate s, assuming that the model is correct, i.e. y = ŷ. • Therefore: est. s.e.(a) = s [√Σxi2/ n Σ[xi - E(x)]2) • This is an expression for the sampling variation of a. • Confidence intervals and t-test are computed in the same way as b.

The Matrix Approach Multiple Regression

The Matrix Approach • The use of matrix notation has many advantages: • If a problem is written and solved in matrix terms, the solution can be applied to any regression problem no matter how many terms there are in the regression equation.

Multiple Regression • Instead of simple regression we now consider multiple regression, with p independent variables. • Notice that simple regression is a special case of multiple regression.

Multiple Regression • Notice for each case i we have a separate equation, so with a sample size of n, we will have n equation. • Which can also be written as a system of equations.

Multiple Regression

Assumptions • Model assumptions. • (1) εi is a random variable • with mean zero, and • constant variance. • (2) There is no correlation between the ith and jth residual terms

Assumptions • (3) εi is normally distributed with mean zero and variance σ2.So εi ~ N(0, σ2) • Under this assumption εi and εj are not only uncorrelated, but necessarily independent. • (4) Covariance between the X’s and residual terms is 0. • No forgotten variables that cause spurious relations. • (5) no multicollinearity. • x variables are not linear dependent, otherwise Det(X)=0 => Inverse does not exist.

OLS Estimation • If these assumptions hold, then the OLS estimators have the following properties: • They are unbiased linear estimators, and • They are also minimum variance estimators. • This means that the OLS estimator is best linear unbiased estimator of a parameter θ. • Linear • Unbiased, i.e., E( ) =  • Minimum variance in class of all linear unbiased estimators • Unbiased and minimum variance properties means that OLS estimators are efficient estimators

OLS Estimation • If one or more of the assumptions are not met than the OLS estimators are no longer best linear unbiased estimator. • Does this matter? • Yes, it means we require an alternative method for characterizing the association between our y and x variables

Multivariate Statistics