Statistics and Data Analysis

Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department of Department of Economics

Statistics and Data Analysis Part 19 – Multiple Regression: 3

Multiple Regression Modeling • Data Preparation • Examining the data • Transformations • Scaling • Analysis of the Regression • Residuals and outliers • Influential data points • The fit of the regression • R squared and adjusted R squared • Analysis of variance • Individual coefficient estimates and t statistics • Testing for significance of a set of coefficients • Prediction

Data Preparation • Get rid of observations with missing values. • Small numbers of missing values, delete observations • Large numbers of missing values – may need to give up on certain variables • There are theories and methods for filling missing values. (Advanced techniques. Usually not useful or appropriate for real world work.) • Be sure that “missingness” is not directly related to the values of the dependent variable. E.g., a regression that follows systematically removing “high” values of Y is likely to be biased if you then try to use the results to describe the entire population.

Using Logs • Generally, use logs for “size” variables • Use logs if you are seeking to estimate elasticities • Use logs if your data span a very large range of values and the independent variables do not (a modeling issue – some art mixed in with the science). • If the data contain 0s or negative values then logs will be inappropriate for the study – do not use ad hoc fixes like adding something to Y so it will be positive.

More on Using Logs • Generally only for continuous variables like income or variables that are essentially continuous. • Not for discrete variables like binary variables or qualititative variables (e.g., stress level = 1,2,3,4,5) • Generally be consistent in the equation – don’t mix logs and levels. • Generally DO NOT take the log of “time” (t) in a model with a time trend. TIME is discrete and not a “measure.”

Residuals • Residual = the difference between the actual value of y and the value predicted by the regression. • E.g., Switzerland: • Estimated equation is DALE = 36.900 + 2.9787*EDUC + .004601*PCHexp • Swiss values are EDUC=9.418360, PCHexp=2646.442 • Regression prediction = 77.1307 • Actual Swiss DALE = 72.71622 • Residual = 72.71622 – 77.1307 = -4.41448 • The regresion “overpredicts” Switzerland

Using Residuals • As indicators of “bad” data • As indicators of observations that deserve attention • As a diagnostic tool to evaluate the regression model

When to Remove “Outliers” • Outliers have very large residuals • Only if it is ABSOLUTELY necessary • The data are obviously miscoded • There is something clearly wrong with the observation • Do not remove outliers just because Minitab flags them. This is not sufficient reason.

Units of Measurement • y = a + b1x1 + b2x2 + e • If you multiply every observation of variable x by the same constant, c, then the regression coefficient will be divided by c. • E.g., multiply X by .001 to change $ to thousands of $, then b is multiplied by 1000. b times x will be unchanged.

Scaling the Data • Units of measurement and coefficients • Macro data and per capita figures • Gasoline data • WHO data • Micro data and normalizations • R&D and Profits

The Gasoline Market Agregate consumption or expenditure data would not be interesting. Income data are already per capita.

The WHO Data Per Capita GDPandPer Capita Health Expenditure. Aggregate values would make no sense. Years

Profits and R&D by Industry Is there a relationship between R&D and Profits? This just shows that big industries have larger profits and R&D than small ones. Gujarati, D. Basic Econometrics, McGraw Hill, 1995, p. 388.

Normalized by Sales Profits/Sales = α + β R&D/Sales + ε

More Movie Madness • McDonald’s and Movies (Craig, Douglas, Greene: International Journal of Marketing) • Log Foreign Box Office(movie,country,year) = α + β1*LogBox(movie,US,year) + β2*LogPCIncome + β4LogMacsPC + GenreEffect + CountryEffect + ε.

We used McDonald’s Per Capita

Movie Madness Data (n=2198)

Macs and Movies Genres (MPAA) 1=Drama 2=Romance 3=Comedy 4=Action 5=Fantasy 6=Adventure 7=Family 8=Animated 9=Thriller 10=Mystery 11=Science Fiction 12=Horror 13=Crime Countries and Some of the Data Code Pop(mm) per cap # of Language Income McDonalds 1 Argentina 37 12090 173 Spanish 2 Chile, 15 9110 70 Spanish 3 Spain 39 19180 300 Spanish 4 Mexico 98 8810 270 Spanish 5 Germany 82 25010 1152 German 6 Austria 8 26310 159 German 7 Australia 19 25370 680 English 8 UK 60 23550 1152 UK

Movie Genres

CRIME is the left out GENRE. AUSTRIA is the left out country. Australia and UK were left out for other reasons (algebraic problem with only 8 countries).

Model Fit • How well does the model fit the data? • R2 measures fit – the larger the better • Time series: expect .9 or better • Cross sections: it depends • Social science data: .1 is good • Industry or market data: .5 is routine

Pretty Good Fit: R2 = .722 Regression of Fuel Bill on Number of Rooms

Success Measure • Hypothesis: There is no regression. • Equivalent Hypothesis: R2 = 0. • How to test: For now, rough rule.Look for F > 2 for multiple regression(Critical F was 4 for simple regression)F = 144.34 for Movie Madness

A Formal Test of the Regression Model • Is there a significant “relationship?” • Equivalently, is R2 > 0? • Statistically, not numerically. • Testing: • Compute • Determine if F is large using the appropriate “table”

The F Test for the Model • Determine the appropriate “critical” value from the table. • Is the F from the computed model larger than the theoretical F from the table? • Yes: Conclude the relationship is significant • No: Conclude R2= 0.

n1 = Number of predictors n2 = Sample size – number of predictors – 1

Compare Sample F to Critical F • F = 144.34 for Movie Madness • Critical value from the table is 1.57536. • Reject the hypothesis of no relationship.

An Equivalent Approach • What is the “P Value?” • We observed an F of 144.34 (or, whatever it is). • If there really were no relationship, how likely is it that we would have observed an F this large (or larger)? • Depends on N and K • The probability is reported with the regression results as the P Value.

The F Test S = 0.952237 R-Sq = 57.0%R-Sq(adj) = 56.6% Analysis of Variance Source DF SS MS F P Regression 20 2617.58 130.88144.340.000 Residual Error 2177 1974.01 0.91 Total 2197 4591.58

A Huge Theorem • R2 always goes up when you add variables to your model. • Always.

The Adjusted R Squared • Adjusted R2 penalizes your model for obtaining its fit with lots of variables. Adjusted R2 = 1 – [(N-1)/(N-K-1)]*(1 – R2) • Adjusted R2 is denoted • Adjusted R2 is not the mean of anything and it is not a square. This is just a name.

The Analysis of Variance S = 0.952237 R-Sq = 57.0%R-Sq(adj) = 56.6% Analysis of Variance Source DF SS MS F P Regression 20 2617.58 130.88 144.34 0.000 Residual Error 2177 1974.01 0.91 Total 2197 4591.58 If N is very large, R2 and Adjusted R2 will not differ by very much.2198 is quite large for this purpose.

Exploring the Relationship • F statistic examines the entire relationship. Benchmark: F > 2 is good for a multiple regression. • What about individual coefficients?(E.g., is there a significant relationship between the number of McDonald’s and the local box office result?)

Use individual “T” statistics. T > +2 or T < -2 suggests the variable is “significant.” T for LogPCMacs = +9.66. This is large.

What About a Group of Variables? • Is Genre significant? • There are 12 genre variables • Some are “significant” (fantasy, mystery, horror) some are not. • Can we conclude the group as a whole is? • Maybe. We need a test.

Theory for the Test • A larger model has a higher R2 than a smaller one. • (Larger model means it has all the variables in the smaller one, plus some additional ones) • Compute this statistic with a calculator

Is Genre Significant? Calc -> Probability Distributions -> F… The critical value shown by Minitab is 1.76 With the 12 Genre indicator variables: R-Squared = 57.0% Without the 12 Genre indicator variables: R-Squared = 55.4% The F statistic is 6.750. F is greater than the critical value. Reject the hypothesis that all the genre coefficients are zero.

Now What? • If the value that Minitab shows you is less than your F statistic, then your F statistic is large • I.e., conclude that the group of coefficients is “significant” • This means that at least one is nonzero, not that all necessarily are.

Summary • Data preparation: missing values • Residuals and outliers • Scaling the data • Model fit and analysis of variance: R2 • Testing • One variable (coefficient) – the t test • A set of variables – the F test

Statistics and Data Analysis