Statistical Inference and Regression Analysis: GB.3302.30

Statistical Inference and Regression Analysis: GB.3302.30 Professor William Greene Stern School of Business IOMS Department Department of Economics

Inference and Regression Perfect Collinearity

Perfect Multicollinearity • If X does not have full rank, then at least one column can be written as a linear combination of the other columns. • X’X does not have rank and cannot be inverted. • b cannot be computed.

Multicollinearity Enhanced Monet Area Effect Model: Height and Width Effects Log(Price) = β1 + β2 log Area + β3log Aspect Ratio + β4log Height + β5Signature + ε (Aspect Ratio = Height/Width)

Short Rank X Enhanced Monet Area Effect Model: Height and Width Effects Log(Price) = β1 + β2 log Area + β3log Aspect Ratio + β4log Height + β5Signature + ε (Aspect Ratio = Height/Width) X1 = 1, X2 = logArea, X3 = LogAspect, X4 = logHeight, X5 = Signature X2 = logH + LogW X3 = logH - LogW X4 = logH x2 + x3 – 2x4 = (logH + logW) + (logH – logW) - 2logH = 0 X5 = Signature X4 = 1/2X2 + 1/2X3 c = [0, 1, 1, -2, 0]

Inference and Regression Least Squares Fit

Minimizing e’e b minimizes ee = (y - Xb)(y - Xb). Any other coefficient vector has a larger sum of squares. (Least squares is least squares.) A quick proof: d = the vector, not b u = y - Xd. Then, uu = (y - Xd)(y-Xd) = [y - Xb - X(d - b)][y - Xb - X(d - b)] = [e - X(d - b)] [e - X(d - b)] Expand to find uu = ee + (d-b)XX(d-b) >ee

Dropping a Variable An important special case. Comparing the results that we get with and without a variable z in the equation in addition to the other variables in X. Results which we can show using the previous result: 1. Dropping a variable(s) cannot improve the fit - that is, reduce the sum of squares. The relevant d is (* ,* ,*. … , 0) i.e., some vector that has a zero in a particular place. 2. Adding a variable(s) cannot degrade the fit - that is, increase the sum of squares. Compare the sum of squares when there is a zero in the location to where the vector does not contain the zero – just reverse the cases.

The Fit of the Regression “Variation:” In the context of the “model” we speak of variation of a variable as movement of the variable, usually associated with (not necessarily caused by) movement of another variable.

Decomposing the Variation of y Total sum of squares = Regression Sum of Squares (SSR) + Residual Sum of Squares (SSE)

Decomposing the Variation

A Fit Measure R2 = (Very Important Result.) R2 is bounded by zero and one if and only if: (a) There is a constant term in X and (b) The line is computed by linear least squares.

Understanding R2 R2 = squared correlation between y and the prediction of y given by the regression

Regression Results ----------------------------------------------------------------------------- Ordinary least squares regression ............ LHS=BOX Mean = 20.72065 Standard deviation = 17.49244 ---------- No. of observations = 62 DegFreedom Mean square Regression Sum of Squares = 9203.46 2 4601.72954 Residual Sum of Squares = 9461.66 59 160.36711 Total Sum of Squares = 18665.1 61 305.98555 ---------- Standard error of e = 12.66361 Root MSE 12.35344 Fit R-squared = .49308 R-bar squared .47590 Model test F[ 2, 59] = 28.69497 Prob F > F* .00000 --------+-------------------------------------------------------------------- | Standard Prob. 95% Confidence BOX| Coefficient Error t |t|>T* Interval --------+-------------------------------------------------------------------- Constant| -12.0721** 5.30813 -2.27 .0266 -22.4758 -1.6684 CNTWAIT3| 53.9033*** 12.29513 4.38 .0000 29.8053 78.0013 BUDGET| .12740*** .04492 2.84 .0062 .03936 .21544 --------+--------------------------------------------------------------------

Adding Variables • R2 never falls when a z is added to the regression. • A useful general result

Adding Variables to a ModelWhat is the effect of adding PN, PD, PS, YEAR to the model (one at a time)? ---------------------------------------------------------------------- Ordinary least squares regression ............ LHS=G Mean = 226.09444 Standard deviation = 50.59182 Number of observs. = 36 Model size Parameters = 3 Degrees of freedom = 33 Residuals Sum of squares = 1472.79834 Fit R-squared = .98356 Adjusted R-squared = .98256 Model test F[ 2, 33] (prob) = 987.1(.0000) Effects of additional variables on the regression below: ------------- Variable Coefficient New R-sqrd Chg.R-sqrd Partial-Rsq Partial F PD -26.0499 .9867 .0031 .1880 7.411 PN -15.1726 .9878 .0043 .2594 11.209 PS -8.2171 .9890 .0055 .3320 15.904 YEAR -2.1958 .9861 .0025 .1549 5.864 --------+------------------------------------------------------------- Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X --------+------------------------------------------------------------- Constant| -79.7535*** 8.67255 -9.196 .0000 PG| -15.1224*** 1.88034 -8.042 .0000 2.31661 Y| .03692*** .00132 28.022 .0000 9232.86 --------+-------------------------------------------------------------

Adjusted R Squared • Adjusted R2 (for degrees of freedom?) • Includes a penalty for variables that don’t add much fit. Can fall when a variable is added to the equation.

Regression Results ----------------------------------------------------------------------------- Ordinary least squares regression ............ LHS=BOX Mean = 20.72065 Standard deviation = 17.49244 ---------- No. of observations = 62 DegFreedom Mean square Regression Sum of Squares = 9203.46 2 4601.72954 Residual Sum of Squares = 9461.66 59 160.36711 Total Sum of Squares = 18665.1 61 305.98555 ---------- Standard error of e = 12.66361 Root MSE 12.35344 Fit R-squared = .49308 R-bar squared .47590 Model test F[ 2, 59] = 28.69497 Prob F > F* .00000 --------+-------------------------------------------------------------------- | Standard Prob. 95% Confidence BOX| Coefficient Error t |t|>T* Interval --------+-------------------------------------------------------------------- Constant| -12.0721** 5.30813 -2.27 .0266 -22.4758 -1.6684 CNTWAIT3| 53.9033*** 12.29513 4.38 .0000 29.8053 78.0013 BUDGET| .12740*** .04492 2.84 .0062 .03936 .21544 --------+--------------------------------------------------------------------

Adjusted R-Squared • We will discover when we study regression with more than one variable, a researcher can increase R2 just by adding variables to a model, even if those variables do not really explain y or have any real relationship at all. • To have a fit measure that accounts for this, “Adjusted R2” is a number that increases with the correlation, but decreases with the number of variables.

Notes About Adjusted R2

Inference and Regression Transformed Data

Linear Transformations of Data • Change units of measurement by dividing every observation – e.g., $ to Millions of $ (see internet buzz regression) by dividing Box by 1000000. • Change meaning of variables:x=(x1=nominal interest=i, x2=inflation=dp, x3=GDP)z=(x1-x2 = real interest i-dp, x2=inflation=dp, x3=GDP) • Change theory of art appreciation:x=(x1=logHeight, x2=logWidth, x3=signature)z=(x1-x2=logAspectRatio, x2=logHeight, x3=signature)

(Linearly) Transformed Data • How does linear transformation affect the results of least squares? Z = XP for KxK nonsingular P(Each variable in Z is a combination of the variables in X.) • Based on X, b = (XX)-1X’y. • You can show (just multiply it out), the coefficients when y is regressed on Z are c = P -1b • “Fitted value” is Zc = XPP-1b = Xb. The same!! • Residuals from using Z are y - Zc = y - Xb (we just proved this.). The same!! • Sum of squared residuals must be identical, as y-Xb = e = y-Zc. • R2 must also be identical, as R2 = 1 - ee/same total SS.

Principal Components • Z = XC • Fewer columns than X • Includes as much ‘variation’ of X as possible • Columns of Z are orthogonal • Why do we do this? • Collinearity • Combine variables of ambiguous identity such as test scores as measures of ‘ability’

+----------------------------------------------------+ | Ordinary least squares regression | | LHS=LOGBOX Mean = 16.47993 | | Standard deviation = .9429722 | | Number of observs. = 62 | | Residuals Sum of squares = 20.54972 | | Standard error of e = .6475971 | | Fit R-squared = .6211405 | | Adjusted R-squared = .5283586 | +----------------------------------------------------+ +--------+--------------+----------------+--------+--------+----------+ |Variable| Coefficient | Standard Error |t-ratio |P[|T|>t]| Mean of X| +--------+--------------+----------------+--------+--------+----------+ |Constant| 12.5388*** .98766 12.695 .0000 | |LOGBUDGT| .23193 .18346 1.264 .2122 3.71468| |STARPOWR| .00175 .01303 .135 .8935 18.0316| |SEQUEL | .43480 .29668 1.466 .1492 .14516| |MPRATING| -.26265* .14179 -1.852 .0700 2.96774| |ACTION | -.83091*** .29297 -2.836 .0066 .22581| |COMEDY | -.03344 .23626 -.142 .8880 .32258| |ANIMATED| -.82655** .38407 -2.152 .0363 .09677| |HORROR | .33094 .36318 .911 .3666 .09677| 4 INTERNET BUZZ VARIABLES |LOGADCT | .29451** .13146 2.240 .0296 8.16947| |LOGCMSON| .05950 .12633 .471 .6397 3.60648| |LOGFNDGO| .02322 .11460 .203 .8403 5.95764| |CNTWAIT3| 2.59489*** .90981 2.852 .0063 .48242| +--------+------------------------------------------------------------+

+----------------------------------------------------+ | Ordinary least squares regression | | LHS=LOGBOX Mean = 16.47993 | | Standard deviation = .9429722 | | Number of observs. = 62 | | Residuals Sum of squares = 25.36721 | | Standard error of e = .6984489 | | Fit R-squared = .5323241 | | Adjusted R-squared = .4513802 | +----------------------------------------------------+ +--------+--------------+----------------+--------+--------+----------+ |Variable| Coefficient | Standard Error |t-ratio |P[|T|>t]| Mean of X| +--------+--------------+----------------+--------+--------+----------+ |Constant| 11.9602*** .91818 13.026 .0000 | |LOGBUDGT| .38159** .18711 2.039 .0465 3.71468| |STARPOWR| .01303 .01315 .991 .3263 18.0316| |SEQUEL | .33147 .28492 1.163 .2500 .14516| |MPRATING| -.21185 .13975 -1.516 .1356 2.96774| |ACTION | -.81404** .30760 -2.646 .0107 .22581| |COMEDY | .04048 .25367 .160 .8738 .32258| |ANIMATED| -.80183* .40776 -1.966 .0546 .09677| |HORROR | .47454 .38629 1.228 .2248 .09677| |PCBUZZ | .39704*** .08575 4.630 .0000 9.19362| +--------+------------------------------------------------------------+

Inference and Regression Model Building and Functional Form

Using Logs

Time Trends in Regression • y = α + β1x + β2t + εβ2 is the period to period increase not explained by anything else. • log y = α + β1log x + β2t + ε (not log t, just t) 100β2 is the period to period % increase not explained by anything else.

U.S. Gasoline Market:Price and Income ElasticitiesDownward Trend in Gasoline Usage

Application: Health Care Data German Health Care Usage Data, There are altogether 27,326 observations on German households, 1984-1994. DOCTOR = 1(number of doctor visits > 0) HOSPITAL = 1(number of hospital visits > 0) HSAT = health satisfaction, coded 0 (low) - 10 (high) DOCVIS = number of doctor visits in last three months HOSPVIS = number of hospital visits in last calendar yearPUBLIC = insured in public health insurance = 1; otherwise = 0 ADDON = insured by add-on insurance = 1; otherswise = 0 INCOME = household nominal monthly net income in German marks / 10000.HHKIDS = children under age 16 in the household = 1; otherwise = 0 EDUC = years of schooling FEMALE = 1(female headed household) AGE = age in years MARRIED = marital status EDUC = years of education

Dummy Variable • D = 0 in one case and 1 in the other • Y = a + bX + cD + e • When D = 0, E[Y|X] = a + bX • When D = 1, E[Y|X] = a + c + bX

A Conspiracy Theory for Art Sales at Auction Sotheby’s and Christies, 1995 to about 2000 conspired on commission rates.

If the Theory is Correct… Sold from 1995 to 2000 Sold before 1995 or after 2000

Evidence: Two Dummy VariablesSignature and Conspiracy Effects The statistical evidence seems to be consistent with the theory.

Set of Dummy Variables • Usually, Z = Type = 1,2,…,K • Y = a + bX + d1 if Type=1 + d2 if Type=2 … + dK if Type=K

A Set of Dummy Variables • Complete set of dummy variables divides the sample into groups. • Fit the regression with “group” effects. • Need to drop one (any one) of the variables to compute the regression. (Avoid the “dummy variable trap.”)

Group Effects in Teacher Ratings

Rankings of 132 U.S.Liberal Arts Colleges Nancy Burnett: Journal of Economic Education, 1998 Reputation=α+β1Religious + β2GenderEcon + β3EconFac +β4North + β5South + β6Midwest + β7West + ε

Minitab does not like this model.

Too many dummy variables cause perfect multicollinearity • If we us all four region dummies • Reputation = a + bn + … if north • Reputation = a + bm + … if midwest • Reputation = a + bs + … if south • Reputation = a + bw + … if west • Only three are needed – so Minitab dropped west • Reputation = a + bn + … if north • Reputation = a + bm + … if midwest • Reputation = a + bs + … if south • Reputation = a + … if west

Unordered Categorical Variables House price data (fictitious) Type 1 = Split levelType 2 = RanchType 3 = ColonialType 4 = Tudor Use 3 dummy variables for this kind of data. (Not all 4) Using variable STYLE in the model makes no sense. You could change the numbering scale any way you like. 1,2,3,4 are just labels.

Transform Style to Types

Hedonic House Price Regression Each of these is relative to a Split Level, since that is the omitted category. E.g., the price of a Ranch house is $74,369 less than a Split Level of the same size with the same number of bedrooms.

We used McDonald’s Per Capita

More Movie Madness • McDonald’s and Movies (Craig, Douglas, Greene: International Journal of Marketing) • Log Foreign Box Office(movie,country,year) = α + β1*LogBox(movie,US,year) + β2*LogPCIncome + β4LogMacsPC + GenreEffect + CountryEffect + ε.

Statistical Inference and Regression Analysis: GB.3302.30

Statistical Inference and Regression Analysis: GB.3302.30

Presentation Transcript

Cooperating Intelligent Systems

Association Analysis

Practical Statistical Relational Learning

Illustration of Regression Analysis

Statistical inference: CLT, confidence intervals, p-values

Introduction to Causal Inference Kenneth A. Frank CSTAT 2-4-2011

Multilevel Regression Models

Regression Models

Statistical Inference and Regression Analysis: Stat-GB.3302.30, Stat-UB.0015.01

Relationships Regression

MGMT 276: Statistical Inference in Management Winter, 2013

Lecture 5 Advanced (= Modern) Regression Analysis

Chapter 6 Further Inference in the Multiple Regression Model

Binary Logistic Regression

The Inference Strategy

Chapter 12: Analyzing Association Between Quantitative Variables: Regression Analysis

Statistical Inference I: Hypothesis testing; sample size

Chapter 8 Statistical inference: Significance Tests About Hypotheses

Regression Analysis with SPSS

Practical Statistical Relational AI

Introduction to Causal Inference Kenneth A. Frank CSTAT 2-4-2011