Autocorrelation in Regression Analysis

Autocorrelation in Regression Analysis • Tests for Autocorrelation • Examples • Durbin-Watson Tests • Modeling Autoregressive Relationships

What causes autocorrelation? • Misspecification • Data Manipulation • Before receipt • After receipt • Event Inertia • Spatial ordering

Positive Zone of No Autocorrelation Zone of Negative autocorrelation indecision indecision autocorrelation |_______________|__________________|_____________|_____________|__________________|___________________| 0 d-lower d-upper 2 4-d-upper 4-d-lower 4 Autocorrelation is clearly evident Ambiguous – cannot rule out autocorrelation Autocorrelation in not evident Checking for Autocorrelation • Test: Durbin-Watson statistic:

Consider the following regression: Source | SS df MS Number of obs = 328 -------------+------------------------------ F( 2, 325) = 52.63 Model | .354067287 2 .177033643 Prob > F = 0.0000 Residual | 1.09315071 325 .003363541 R-squared = 0.2447 -------------+------------------------------ Adj R-squared = 0.2400 Total | 1.447218 327 .004425743 Root MSE = .058 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ice | .060075 .006827 8.80 0.000 .0466443 .0735056 quantity | -2.27e-06 2.91e-07 -7.79 0.000 -2.84e-06 -1.69e-06 _cons | .2783773 .0077177 36.07 0.000 .2631944 .2935602 ------------------------------------------------------------------------------ Because this is time series data, we should consider the possibility of autocorrelation. To run the Durbin-Watson, first we have to specify the data as time series with the tsset command. Next we use the dwstat command. Durbin-Watson d-statistic( 3, 328) = .2109072

Find the D-upper and D-lower • Check a Durbin Watson table for the numbers for d-upper and d-lower. • http://hadm.sph.sc.edu/courses/J716/Dw.html • For n=20 and k=2, α = .05 the values are: • Lower = 1.643 • Upper = 1.704 Durbin's alternative test for autocorrelation --------------------------------------------------------------------------- lags(p) | chi2 df Prob > chi2 -------------+------------------------------------------------------------- 1 | 1292.509 1 0.0000 --------------------------------------------------------------------------- H0: no serial correlation

Alternatives to the d-statistic • The d-statistic is not validin models with a lagged dependent variable • In the case of a lagged LHS variable you must use the Durbin-a test (the command is durbina in Stata) • Also, the d-statistic is only for first order autocorrelation. In other instances you may use the Durbin-a • Why would you suspect other than 1st order autocorrelation?

The Runs Test • An alternative to the D-W test is a formalized examination of the signs of the residuals. We would expect that the signs of the residuals will be random in the absence of autocorrelation. • The first step is to estimate the model and predict the residuals.

Runs continued • Next, order the signs of the residuals against time (or spatial ordering in the case of cross-sectional data) and see if there are excessive “runs” of positives or negatives. Alternatively, you can graph the residuals and look for the same trends.

Runs test continued The final step is to use the expected mean and deviation in a standard t-test Stata does this automatically with the runtest command!

Visual diagnosis of autocorrelation (in a single series) • A correlogram is a good tool to identify if a series is autocorrelated

Dealing with autocorrelation • D-W is not appropriate for auto-regressive (AR) models, where: • In this case, we use the Durbin alternative test • For AR models, need to explicitly estimate the correlation between Yi and Yi-1 as a model parameter • Techniques: • AR1 models (closest to regression; 1st order only) • ARIMA (any order)

Dealing with Autocorrelation • There are several approaches to resolving problems of autocorrelation. • Lagged dependent variables • Differencing the Dependent variable • GLS • ARIMA

Lagged dependent variables • The most common solution • Simply create a new variable that equals Y at t-1, and use as a RHS variable • To do this in Stata, simply use the generate command with the new variable equal to L.variable • gen lagy = L.y • gen laglagy = L2.y • This correction should be based on a theoretic belief for the specification • May cause more problems than it solves • Also costs a degree of freedom (lost observation) • There are several advanced techniques for dealing with this as well

Differencing • Differencing is simply the act of subtracting the previous observation value from the current observation. • To do this in Stata, again use the generate command with a capital D (instead of the L for lags) • This process is effective; however, it is an EXPENSIVE correction • This technique “throws away” long-term trends • Assumes the Rho = 1 exactly

GLS and ARIMA • GLS approaches use maximum likelihood to estimate Rho and correct the model • These are good corrections, and can be replicated in OLS • ARIMA is an acronym for Autoregressive Integrated Moving Average • This process is a univariate “filter” used to cleanse variables of a variety of pathologies before analysis

Corrections based on Rho • There are several ways to estimate rho, the most simple being calculating it from the residuals We then estimate the regression by transforming the regressors so that: and This gives the regression:

High tech solutions • Stata also offers the option of estimating the model with the AR (with multiple ways of estimating rho). There is also what is known as a prais-winsten regression which generates values for the lost observation • For the truly adventurous, there is also the option of doing a full ARIMA model

Prais-winsten regression • Prais-Winsten AR(1) regression -- iterated estimates • Source | SS df MS Number of obs = 328 • -------------+------------------------------ F( 2, 325) = 15.39 • Model | .012722308 2 .006361154 Prob > F = 0.0000 • Residual | .134323736 325 .000413304 R-squared = 0.0865 • -------------+------------------------------ Adj R-squared = 0.0809 • Total | .147046044 327 .000449682 Root MSE = .02033 • ------------------------------------------------------------------------------ • price | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • ice | .0098603 .0059994 1.64 0.101 -.0019422 .0216629 • quantity | -1.11e-07 1.70e-07 -0.66 0.512 -4.45e-07 2.22e-07 • _cons | .2517135 .0195727 12.86 0.000 .2132082 .2902188 • -------------+---------------------------------------------------------------- • rho | .9436986 • ------------------------------------------------------------------------------ • Durbin-Watson statistic (original) 0.210907 • Durbin-Watson statistic (transformed) 1.977062

ARIMA • The ARIMA model allows us to test the hypothesis of autocorrelation and remove it from the data. • This is an iterative process akin to the purging we did when creating the ystar variable.

The model ARIMA regression Sample: 1 to 328 Number of obs = 328 Wald chi2(1) = 3804.80 Log likelihood = 811.6018 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ | OPG price | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- price | _cons | .2558135 .0207937 12.30 0.000 .2150587 .2965683 -------------+---------------------------------------------------------------- ARMA | ar | L1. | .9567067 .01551 61.68 0.000 .9263076 .9871058 -------------+---------------------------------------------------------------- /sigma | .0203009 .000342 59.35 0.000 .0196305 .0209713 ------------------------------------------------------------------------------ Estimate of rho Significant lag

The residuals of the ARIMA model There are a few significant lags a ways back. Generally we should expect some, but this mess is probably an indicator of a seasonal trend (well beyond the scope of this lecture)!

ARIMA with a covariate ARIMA regression Sample: 1 to 328 Number of obs = 328 Wald chi2(3) = 3569.57 Log likelihood = 812.9607 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ | OPG price | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- price | ice | .0095013 .0064945 1.46 0.143 -.0032276 .0222303 quantity | -1.04e-07 1.22e-07 -0.85 0.393 -3.43e-07 1.35e-07 _cons | .2531552 .0220777 11.47 0.000 .2098838 .2964267 -------------+---------------------------------------------------------------- ARMA | ar | L1. | .9542692 .01628 58.62 0.000 .9223611 .9861773 -------------+---------------------------------------------------------------- /sigma | .0202185 .0003471 58.25 0.000 .0195382 .0208988 ------------------------------------------------------------------------------

Final thoughts • Each correction has a “best” application. • If we wanted to evaluate a mean shift (dummy variable only model), calculating rho will not be a good choice. Then we would want to use the lagged dependent variable • Also, where we want to test the effect of inertia, it is probably better to use the lag

Final Thoughts Continued • In Small N, calculating rho tends to be more accurate • ARIMA is one of the best options, however, it is very complicated! • When dealing with time, the number of time periods and the spacing of the observations is VERY IMPORTANT! • When using estimates of rho, a good rule of thumb is to make sure you have 25-30 time points at a minimum. More if the observations are too close for the process you are observing!

Next Time: • Review for Exam • Plenary Session • Exam Posting • Available after class Wednesday

Autocorrelation in Regression Analysis