470 likes | 488 Views
This course provides an in-depth overview of simple and multiple regression analysis, exploring how to interpret the results and make predictions using regression models. Understand the steps of model selection, estimation, validation, and implementation with case study examples and chapter goals related to regression analysis. Learn to determine when regression analysis is appropriate, the assumptions underlying regression, and the influence of each variable in prediction.
E N D
Statistics & Data Analysis Course Number B01.1305 Course Section 60 Meeting Time Monday 6-9:30 pm Multiple Regression
Class Outline • Overview of simple and multiple regression • Details of multiple regression • Case study example
Multiple Regression Chapters 12-13
Chapter Goals • Determine when regression analysis is appropriate • Understand how regression helps make predictions • Understand the assumptions underlying regression • Interpret results of a regression analysis from a statistical and managerial viewpoint • Understand the steps of model selection, estimation, validation and implementation
What is Multiple Regression Analysis?? • A statistical technique used to analyze the relationship between a single dependent variable and several independent variable • Objective is to use the independent variables to predict the single dependent value • Each independent variable is weighted by the analysis procedure to ensure a maximal prediction • Weights denote relative contribution of independent variable to overall prediction • Facilitate interpretation as to the influence of each variable
Motivating Example • Credit company interested in determining which factors affected the number of credit cards used • Three potential factors were identified • Family size • Family income • Number of cars owned • Data were collected for each of 8 families
Setting a Baseline • Let’s first calculate a baseline to compare the predictive ability of our regression models • Baseline should represent our best prediction without the use of any independent variables • When comparing with regression models, the average of the dependent variable gives the most accurate baseline prediction
Simple Regression • We are interested in improving our predictions • Let’s determine whether knowledge of one of these independent variables helps our predictions • Simple Regression is a procedure for predicting data, minimizing the sum of squared errors of prediction based on one independent variable
Correlation Coefficient • Correlation coefficient (r) describes the linear relationship between two variables • Two variables are said to be correlated if changes in one variable are associated with changes in the other variable • What are the properties of the correlation coefficient?
Correlation Matrix Y v1 v2 v1 0.866 0.005 v2 0.829 0.673 0.011 0.068 v3 0.342 0.192 0.301 0.407 0.649 0.469 Cell Contents: Pearson correlation P-Value
Simple Regression Results The regression equation is Y = 2.87 + 0.971 v1 Predictor Coef SE Coef T P Constant 2.871 1.029 2.79 0.032 v1 0.9714 0.2286 4.25 0.005 S = 0.9562 R-Sq = 75.1% R-Sq(adj) = 70.9%
Confidence Interval for Prediction • Because we did not achieve perfect prediction, we also need to estimate the range of predicted values we might expect • Point estimate is our best estimate of the dependent variable • From this point estimate, we can also calculate the range of predicted values based on a measure of the prediction errors we expect to make • For example: • The predicted number of credit cards for the average family size of 4.25 is 7.00 • The expected range (95% prediction interval) is (4.518, 9.482)
Prediction using Several Variables • Just demonstrated how simple regression helped improve our prediction of credit car usage • By using data on family size, our predictions are much more accurate than using the simple arithmetic average • Can we improve our prediction even further by using additional data?
Impact of Multicollinearity • Ability of additional independent variables to improve prediction depends on: • Correlation between dependent and independent variables • Correlation between independent variables • Multicollinearity: association between independent variables • Impact: • Reduce any single independent variable’s predictive power • As collinearity increases, unique variance explained decreases
Multiple Regression Equation The regression equation is Y = 0.48 + 0.632 v1 + 0.216 v2 Predictor Coef SE Coef T P Constant 0.482 1.461 0.33 0.755 v1 0.6322 0.2523 2.51 0.054 v2 0.2158 0.1080 2.00 0.102 S = 0.7810 R-Sq = 86.1% R-Sq(adj) = 80.6%
Multiple Regression Equation The regression equation is Y = 0.29 + 0.635 v1 + 0.200 v2 + 0.272 v3 Predictor Coef SE Coef T P Constant 0.286 1.606 0.18 0.867 v1 0.6346 0.2710 2.34 0.079 v2 0.1995 0.1194 1.67 0.170 v3 0.2716 0.4702 0.58 0.594 S = 0.8389 R-Sq = 87.2% R-Sq(adj) = 77.6%
Summary • Regression analysis is a simple dependence technique that can provide both prediction and explanation
Decision Process for Multiple Regression • STAGE 1: Define objectives • STAGE 2: Research design • STAGE 3: Modeling Assumptions • STAGE 4: Estimate Model and Assess Fit • STAGE 5: Interpret Regression Model • STAGE 6: Validate Results
Define Objectives • Objective of multiple regression • Form the optimal predictor of the dependent measure • Provide an objective means of assessing the predictive power of a set of variables • Means of objectively assessing the degree and direction of relationship between dependent and independent variables • Provides insights into the relationships among independent variables in their predictive ability • Appropriate when we are interested in a statistical (not functional) relationship • Variable selection • Ultimate success depends on selecting meaningful variables • Measurement error: degree that the dependent variable is an accurate measure of the concept being studied • Specification error: Inclusion of irrelevant variable or omission of relevant variables
Research Design • Researcher must consider • Sample size • Nature of independent variables • Possible creation of new variables • Incorporate dummy variables • Represent categories in the model • Requires k-1 variables to represent k categories • Represent curvilinear effects with transformations or polynomials
Dummy Variables • Variables used to represent categorical variables • Two categories: 0=male; 1=female • k categories (urban, suburban, rural) • Requires k-1 variables • Choose a base category (residence=urban)
Dummy Variables (cont.) • Interpretation: Regression coefficient represents the expected difference in the dependent variable between the category and base category…holding all other variables constant • Example: A regression model relates a person’s percentage salary increase to seniority in years, gender, and location (urban, suburban, or rural)
Transformations to Linearity • If the relationship between an independent and dependent variable is not linear, we can straighten it out • Transformations typically done by trial-and-error • Square Root • Logarithm • Inverse • Polynomial terms • Key features to look for • Is the relation nonlinear? • Is there a pattern of increasing variability along the vertical axis?
Adding Polynomial Terms • Polynomials are power transformations of the original variables • Any number of nonlinear components may be added depending on the relationship • Each new polynomial term is entered into the regression equation and has its significance assessed
Adding Polynomial Terms (cont.) • The regression equation is • Y = - 589 + 107 X • Predictor Coef SE Coef T P • Constant -588.67 42.32 -13.91 0.000 • X 106.995 2.700 39.63 0.000 • S = 166.9 R-Sq = 94.1%
Adding Polynomial Terms (cont.) • The regression equation is • Y = - 28.0 + 11.9 X + 3.30 X^2 • Predictor Coef SE Coef T P • Constant -28.02 62.32 -0.45 0.654 • X 11.853 9.502 1.25 0.215 • X^2 3.2961 0.3226 10.22 0.000 • S = 116.4 R-Sq = 97.2% R-Sq(adj) = 97.1%
Transform or Polynomial??? • Data transformations are useful only in simple curvilinear relationships • Do not provide a statistic means for assessing appropriateness • Only accommodate univariate terms • Polynomials are restrictive with small sample sizes • Also introduces some multicollinearity • Common practice: • Start with the linear component • Sequentially add higher-order polynomials until non-significance is achieved
Interaction Effects • Occurs when one independent variable changes across values of another independent variable • Example: Might expect a relationship between family income and family size • Expect change in credit card usage based on family size might be lower for families with low incomes and higher for families with higher incomes • Without the interaction term, we are assuming that family size has a constant effect on the number of cards used
Modeling Assumptions • Linearity of the phenomenon measured • Review scatterplots of dependent versus independent variables • Constant variance of the error terms • Review residuals versus fitted values plot • Review residuals versus independent variable plot • Independence of the error terms • Review residuals versus order plot • Normality of the error term distribution • Review histogram of residuals
Estimate Model and Assess Fit • Model Selection • Confirmatory specification • Sequential Search methods • Stepwise • Forward • Backward • Testing regression analysis for meeting the assumptions • Identifying influential observations
Interpret Regression Model • Evaluate regression model that was estimated • Assess and interpret regression model coefficients • Evaluate potential variables that were omitted during model selection • Multicollinearity can have an effect
Assessing Multicollinearity • Key issue in interpreting regression analysis is the correlation among the independent variables • In most situations, multicollinearity will exist • Researcher needs to • Assess degree • Determine its impact • Determine an appropriate remedy
Effects of Multicollinearity • Effects on explanation • Limits size of R2 • Makes it difficult to add unique explanatory prediction from additional variables • Makes determining the contribution of each variable difficult • Effects of variables are “mixed” • Estimation • Can prevent coefficient estimation (in extreme cases) • Coefficients incorrectly estimated • Coefficients having wrong signs
Identifying Multicollinearity • Variance Inflation Factor (VIF) • Tells us the degree to which each independent variable is explained by the others • VIF = 1 indicates no relation; VIF > 1, otherwise • VIF > 5 suggests coefficients are estimated incorrectly • The largest VIF among all predictors is often used as an indicator of severe multicollinearity
Remedies for Multicollinearity • Omit one or more highly correlated variables • Could create specification error • Use the model for prediction only • Use the correlations to understand the variable relationships • Use a more sophisticated method of analysis • Beyond the scope of this class
Validate Results • Best guide to model “quality” is testing it on a set of previously validated data • Test on additional or “held-out” data • Address all outliers and influential points • All assumptions are met
Illustration of a Regression Analysis • Marketing managers of a company are having difficulty in evaluating the field sales representatives’ performance • Reps travel among the outlets and create displays trying to increase sales volume • Job involves lots of travel time
Illustration Data • Data are collected on 51 reps • DATA: • District Number • Profit (rep’s net profit margin) • Area (thousands of square miles) • POPN: (millions of people in district) • OUTLETS: (number of outlets in district) • COMMIS: (1=full commission; 0=partially salaried)
Next Time… • Conclude regression analysis • Introduction to time series regression • Review for final exam