Regression Analysis

Regression Analysis Using Excel

Econometrics • Econometrics is simply the statistical analysis of economic phenomena • Here, we just summarize some of the concepts of basic econometrics and primarily relate the econometric estimation of economic phenomena using Excel • We discuss some basics of estimating demand, production functions, supply and cost functions

Brief instruction on estimating a demand function • Suppose there exists underlying data on the relation between a dependent variable, Q, and an explanatory or independent variable, X • Suppose we have 6 data points of pairs of Q values and X values as shown in the following graph

Data points in Y and X, some of which are not on the straight line Some of the points do not lie on a straight line, or even a smooth curve So there is not a direct relationship between Y and X as shown in the graph Y A B D C F E X The job of the econometrician is to find a line or curve that best fits the data points – to assess the relationship between Y and X

Suppose a linear relationship between Y and X There is some random variation in this linear relationship as seen by the data points relative to the straight line Y A B D C F E X Mathematically, this variation implies that the linear relationship between Y and X is given by Y = a + bX + Є

a and b are unknown parameters that define the linear relationship with a being the Y axis intercept, and b reflecting the slope of the linear line Because the parameters between Y and X are unknown, the econometrician must find out the values of the parameters, a and b Y A B Y = a + bX D C F E Note that for any line drawn through the points in the graph, there will be some discrepancy between the actual points and the line -- points A and D lie above the line, & points C and E lie below the line X

The deviations between the actual points on the line are given by the distance of the dashed lines from the line to the points, and these deviations are given as eÂ, e^C, e^D and eÊ The line, Y = a + bX represents the expected, or average, relationship between Y and X --- so the deviations are analogous to the deviations from the mean used to calculate the variance of a random variable Y A eÂ B Y = a + bX D e^D e^C C eÊ F E The econometrician finds the values of the parameters a and b that minimize the sum of squared deviations between actual points and the line X

The regression line then is the line that minimizes the squared deviations between the line (the expected relation) and the actual data points, A, B, C, D, E and F --- the values of the parameters a and b which are frequently denoted as a^ and b^ are called the parameter estimates Y The corresponding line is called the least squares regression line for the equation Y = a + bX + Є is given by Y = a^ + b^X A e^A B Y = a + bX+ Є D e^D e^C C e^E F E The parameter estimates, a^ and b^ are the values of a and b that result in the smallest sum of squared errors between a line and the actual data X

Software for regression analysis • There are many software packages that actually are coded to derive the least squares regression line • These packages are developed to provide various kinds of parameter estimates given alternative models of relationships, such as economic relationships • Spreadsheet packages are available to do basic regression analysis, such as the use of Excel

Let’s look at some data for a product sold in a Chinese market and the price associated with each quantity sold

Now, let’s use the regression command in Excel to find the least squares estimates of demand • The regression tool is found in TOOLS/Data Analysis/Regression in Excel • Click on TOOLS, then click Data Analysis and then click on Regression • A pool-down box is displayed • You are asked to provide cell address range for Y, the dependent variable • You are asked to provide the cell address range for X (and you can have multiple independent variables, so you can input a matrix of X variables) • You can input labels along with your cell address ranges if the label for each variable is included at the top of the columns containing the Y variable and the X variables --- just click labels

you can have the regression algorithm compute and plot residuals, (Y – Y^), which is a plot of the Y minus • Y^ = a^ + b^X + Є • Compute standardized residuals • And adjust the output that results in the analysis of variance table and information

Once the data are loaded to the Regression package by populating the pool-down menu • Then click ok • And the output of the regression provides alternative regression statistics, and an analysis of variance table which provides information about the estimates a^ and b^ and there significance relative zero --- where a parameter is insignificant, meaning the independent variable X has no influence on the dependent variable Y, in the hypothesized relationship Y = a + bX

Let’s go back to our demand data • We had 12 observations on quantity purchased and the price in Yuan for each quantity purchased • 12 observations is quite minimal to have statistical robustness qualities in the estimates, but the number of observations is sufficient for this simple example problem

The data are given below --- we anticipate that the demand relationship (from economic theory) will yield Y = a + bX, where the estimate for b is b < 0 --- downward sloping demand

Input the data • Input the quantity purchased by stating the cell range of the quantity data into the Y-input window • Input the price data by stating the cell range of the price data into the X-input window • One can click in each window, and then drag down the cells in each respective Y, then X column of data to input the cell ranges • Our data are set up with labels in the first row of the cell range, so we can click labels and we will get the names of the variable printed out in the analysis of variance table in the resulting output • Then click OK --- to get regression estimate results

Here we show our data, and then the output of the regression analysis on the data --- regression statistics such as R2, adjusted R2, standard error of the estimate --- analysis of variance table giving us the estimate of the intercept term, a^ = 148.64, and the slope of the demand curve, b^= -2.26 (rounded)

We also get the sums of squares summary and the F-test which is a test of whether the model as a whole is significant, i.e., Y and X are related in the manner estimated --- downward sloping demand and positive intercept for the Y axis

Similarly, we obtain the standard error of each estimated parameter, a^, b^ and the t-ratio = ta^ = a^/a^ for estimate a^, and t-ratio = tb^ = b^/b^ The t-ratio of a parameter estimate is the ratio of the value of the parameter estimate to its standard error When the t-ratio is large in absolute value, then one can be confident that the true parameter is not equal to zero --- there is a relationship between X and Y in this case --- for our number of observations, a t-ratio of |1.96| is sufficient

P-values are also reported --- these are a more precise measure of statistical significance, i.e., the estimated parameter is greater than zero in absolute value --- the p-value for price is 0.0007 approximately, meaning there is only a 7 in 100,000 chance that the true parameter for b (coefficient of price) is actually zero and is insignificant in explaining movements in quantity of purchases in our estimated demand equation--- the lower the p-value the more confident you are in the particular estimate Usually p-values of 0.05 or lower are considered low enough for a researcher to be confident in the value of the estimated parameter

The estimated R2 = [explained variation] /[total variation] = sums of squares regression/ sums of squares total 0 <= R2 <= 1 is the range of the R2 The closer R2 is to 1 the better the overall fit of the estimated regression equation to the actual data and relationship of Y to X So R2 measures goodness of fit Adjusted R2 = 1 – (1- R2){(n-1)/(n-k)} with n = the total number of observations, and k = the number of estimated parameters n – k represents the residual degrees of freedom after conducting the regression analysis --- the adjusted R2 penalizes regressions with only a few degrees of freedom { estimating numerous coefficients with small n}

The alternative measure of goodness of fit is the F-statistic, which provides a measure of the total variation explained by the regression relative to the total unexplained variation --- the greater the F-statistic, the better the overall fit of the regression line through the actual data The R2 measure of goodness of fit can not tell us a rule for how high the measure should be to indicate good fit ---- the F-statistic does not suffer from this shortcoming ---- looking at Significance F tells us the significance of the F-statistic --- here the value is 0.0007 approximately, or there is only a 0.07 % chance the model fits the data purely by accident The lower the significance value of the F-statistic, the more confident one is in the actual regression model estimate

Regression Analysis