Statistics & Data Analysis

Statistics & Data Analysis Course Number B01.1305 Course Section 60 Meeting Time Monday 6-9:30 pm Multiple Regression

Class Outline • Overview of simple and multiple regression • Details of multiple regression • Case study example

Multiple Regression Chapters 12-13

Chapter Goals • Determine when regression analysis is appropriate • Understand how regression helps make predictions • Understand the assumptions underlying regression • Interpret results of a regression analysis from a statistical and managerial viewpoint • Understand the steps of model selection, estimation, validation and implementation

What is Multiple Regression Analysis?? • A statistical technique used to analyze the relationship between a single dependent variable and several independent variable • Objective is to use the independent variables to predict the single dependent value • Each independent variable is weighted by the analysis procedure to ensure a maximal prediction • Weights denote relative contribution of independent variable to overall prediction • Facilitate interpretation as to the influence of each variable

Motivating Example • Credit company interested in determining which factors affected the number of credit cards used • Three potential factors were identified • Family size • Family income • Number of cars owned • Data were collected for each of 8 families

Motivating Example (cont)

Setting a Baseline • Let’s first calculate a baseline to compare the predictive ability of our regression models • Baseline should represent our best prediction without the use of any independent variables • When comparing with regression models, the average of the dependent variable gives the most accurate baseline prediction

Baseline Prediction (cont)

Simple Regression • We are interested in improving our predictions • Let’s determine whether knowledge of one of these independent variables helps our predictions • Simple Regression is a procedure for predicting data, minimizing the sum of squared errors of prediction based on one independent variable

Correlation Coefficient • Correlation coefficient (r) describes the linear relationship between two variables • Two variables are said to be correlated if changes in one variable are associated with changes in the other variable • What are the properties of the correlation coefficient?

Matrix Plot

Correlation Matrix Y v1 v2 v1 0.866 0.005 v2 0.829 0.673 0.011 0.068 v3 0.342 0.192 0.301 0.407 0.649 0.469 Cell Contents: Pearson correlation P-Value

Simple Regression Results The regression equation is Y = 2.87 + 0.971 v1 Predictor Coef SE Coef T P Constant 2.871 1.029 2.79 0.032 v1 0.9714 0.2286 4.25 0.005 S = 0.9562 R-Sq = 75.1% R-Sq(adj) = 70.9%

Motivating Example (cont)

Confidence Interval for Prediction • Because we did not achieve perfect prediction, we also need to estimate the range of predicted values we might expect • Point estimate is our best estimate of the dependent variable • From this point estimate, we can also calculate the range of predicted values based on a measure of the prediction errors we expect to make • For example: • The predicted number of credit cards for the average family size of 4.25 is 7.00 • The expected range (95% prediction interval) is (4.518, 9.482)

Prediction using Several Variables • Just demonstrated how simple regression helped improve our prediction of credit car usage • By using data on family size, our predictions are much more accurate than using the simple arithmetic average • Can we improve our prediction even further by using additional data?

Impact of Multicollinearity • Ability of additional independent variables to improve prediction depends on: • Correlation between dependent and independent variables • Correlation between independent variables • Multicollinearity: association between independent variables • Impact: • Reduce any single independent variable’s predictive power • As collinearity increases, unique variance explained decreases

Multiple Regression Equation The regression equation is Y = 0.48 + 0.632 v1 + 0.216 v2 Predictor Coef SE Coef T P Constant 0.482 1.461 0.33 0.755 v1 0.6322 0.2523 2.51 0.054 v2 0.2158 0.1080 2.00 0.102 S = 0.7810 R-Sq = 86.1% R-Sq(adj) = 80.6%

Multiple Regression Equation The regression equation is Y = 0.29 + 0.635 v1 + 0.200 v2 + 0.272 v3 Predictor Coef SE Coef T P Constant 0.286 1.606 0.18 0.867 v1 0.6346 0.2710 2.34 0.079 v2 0.1995 0.1194 1.67 0.170 v3 0.2716 0.4702 0.58 0.594 S = 0.8389 R-Sq = 87.2% R-Sq(adj) = 77.6%

Summary • Regression analysis is a simple dependence technique that can provide both prediction and explanation

Decision Process for Multiple Regression • STAGE 1: Define objectives • STAGE 2: Research design • STAGE 3: Modeling Assumptions • STAGE 4: Estimate Model and Assess Fit • STAGE 5: Interpret Regression Model • STAGE 6: Validate Results

Define Objectives • Objective of multiple regression • Form the optimal predictor of the dependent measure • Provide an objective means of assessing the predictive power of a set of variables • Means of objectively assessing the degree and direction of relationship between dependent and independent variables • Provides insights into the relationships among independent variables in their predictive ability • Appropriate when we are interested in a statistical (not functional) relationship • Variable selection • Ultimate success depends on selecting meaningful variables • Measurement error: degree that the dependent variable is an accurate measure of the concept being studied • Specification error: Inclusion of irrelevant variable or omission of relevant variables

Research Design • Researcher must consider • Sample size • Nature of independent variables • Possible creation of new variables • Incorporate dummy variables • Represent categories in the model • Requires k-1 variables to represent k categories • Represent curvilinear effects with transformations or polynomials

Dummy Variables • Variables used to represent categorical variables • Two categories: 0=male; 1=female • k categories (urban, suburban, rural) • Requires k-1 variables • Choose a base category (residence=urban)

Dummy Variables (cont.) • Interpretation: Regression coefficient represents the expected difference in the dependent variable between the category and base category…holding all other variables constant • Example: A regression model relates a person’s percentage salary increase to seniority in years, gender, and location (urban, suburban, or rural)

Dummy Variable (cont.)

Transformations to Linearity • If the relationship between an independent and dependent variable is not linear, we can straighten it out • Transformations typically done by trial-and-error • Square Root • Logarithm • Inverse • Polynomial terms • Key features to look for • Is the relation nonlinear? • Is there a pattern of increasing variability along the vertical axis?

Transformations to Linearity (cont.)

Adding Polynomial Terms • Polynomials are power transformations of the original variables • Any number of nonlinear components may be added depending on the relationship • Each new polynomial term is entered into the regression equation and has its significance assessed

Adding Polynomial Terms (cont.)

Adding Polynomial Terms (cont.) • The regression equation is • Y = - 589 + 107 X • Predictor Coef SE Coef T P • Constant -588.67 42.32 -13.91 0.000 • X 106.995 2.700 39.63 0.000 • S = 166.9 R-Sq = 94.1%

Adding Polynomial Terms (cont.) • The regression equation is • Y = - 28.0 + 11.9 X + 3.30 X^2 • Predictor Coef SE Coef T P • Constant -28.02 62.32 -0.45 0.654 • X 11.853 9.502 1.25 0.215 • X^2 3.2961 0.3226 10.22 0.000 • S = 116.4 R-Sq = 97.2% R-Sq(adj) = 97.1%

Transform or Polynomial??? • Data transformations are useful only in simple curvilinear relationships • Do not provide a statistic means for assessing appropriateness • Only accommodate univariate terms • Polynomials are restrictive with small sample sizes • Also introduces some multicollinearity • Common practice: • Start with the linear component • Sequentially add higher-order polynomials until non-significance is achieved

Interaction Effects • Occurs when one independent variable changes across values of another independent variable • Example: Might expect a relationship between family income and family size • Expect change in credit card usage based on family size might be lower for families with low incomes and higher for families with higher incomes • Without the interaction term, we are assuming that family size has a constant effect on the number of cards used

Modeling Assumptions • Linearity of the phenomenon measured • Review scatterplots of dependent versus independent variables • Constant variance of the error terms • Review residuals versus fitted values plot • Review residuals versus independent variable plot • Independence of the error terms • Review residuals versus order plot • Normality of the error term distribution • Review histogram of residuals

Estimate Model and Assess Fit • Model Selection • Confirmatory specification • Sequential Search methods • Stepwise • Forward • Backward • Testing regression analysis for meeting the assumptions • Identifying influential observations

Interpret Regression Model • Evaluate regression model that was estimated • Assess and interpret regression model coefficients • Evaluate potential variables that were omitted during model selection • Multicollinearity can have an effect

Assessing Multicollinearity • Key issue in interpreting regression analysis is the correlation among the independent variables • In most situations, multicollinearity will exist • Researcher needs to • Assess degree • Determine its impact • Determine an appropriate remedy

Effects of Multicollinearity • Effects on explanation • Limits size of R2 • Makes it difficult to add unique explanatory prediction from additional variables • Makes determining the contribution of each variable difficult • Effects of variables are “mixed” • Estimation • Can prevent coefficient estimation (in extreme cases) • Coefficients incorrectly estimated • Coefficients having wrong signs

Identifying Multicollinearity • Variance Inflation Factor (VIF) • Tells us the degree to which each independent variable is explained by the others • VIF = 1 indicates no relation; VIF > 1, otherwise • VIF > 5 suggests coefficients are estimated incorrectly • The largest VIF among all predictors is often used as an indicator of severe multicollinearity

Remedies for Multicollinearity • Omit one or more highly correlated variables • Could create specification error • Use the model for prediction only • Use the correlations to understand the variable relationships • Use a more sophisticated method of analysis • Beyond the scope of this class

Validate Results • Best guide to model “quality” is testing it on a set of previously validated data • Test on additional or “held-out” data • Address all outliers and influential points • All assumptions are met

Illustration of a Regression Analysis • Marketing managers of a company are having difficulty in evaluating the field sales representatives’ performance • Reps travel among the outlets and create displays trying to increase sales volume • Job involves lots of travel time

Illustration Data • Data are collected on 51 reps • DATA: • District Number • Profit (rep’s net profit margin) • Area (thousands of square miles) • POPN: (millions of people in district) • OUTLETS: (number of outlets in district) • COMMIS: (1=full commission; 0=partially salaried)

Next Time… • Conclude regression analysis • Introduction to time series regression • Review for final exam

Statistics & Data Analysis