1 / 47

Statistics & Data Analysis

Statistics & Data Analysis. Course Number B01.1305 Course Section 60 Meeting Time Monday 6-9:30 pm. Multiple Regression. Class Outline. Overview of simple and multiple regression Details of multiple regression Case study example. Multiple Regression. Chapters 12-13. Chapter Goals.

jimmyg
Download Presentation

Statistics & Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics & Data Analysis Course Number B01.1305 Course Section 60 Meeting Time Monday 6-9:30 pm Multiple Regression

  2. Class Outline • Overview of simple and multiple regression • Details of multiple regression • Case study example

  3. Multiple Regression Chapters 12-13

  4. Chapter Goals • Determine when regression analysis is appropriate • Understand how regression helps make predictions • Understand the assumptions underlying regression • Interpret results of a regression analysis from a statistical and managerial viewpoint • Understand the steps of model selection, estimation, validation and implementation

  5. What is Multiple Regression Analysis?? • A statistical technique used to analyze the relationship between a single dependent variable and several independent variable • Objective is to use the independent variables to predict the single dependent value • Each independent variable is weighted by the analysis procedure to ensure a maximal prediction • Weights denote relative contribution of independent variable to overall prediction • Facilitate interpretation as to the influence of each variable

  6. Motivating Example • Credit company interested in determining which factors affected the number of credit cards used • Three potential factors were identified • Family size • Family income • Number of cars owned • Data were collected for each of 8 families

  7. Motivating Example (cont)

  8. Setting a Baseline • Let’s first calculate a baseline to compare the predictive ability of our regression models • Baseline should represent our best prediction without the use of any independent variables • When comparing with regression models, the average of the dependent variable gives the most accurate baseline prediction

  9. Baseline Prediction (cont)

  10. Simple Regression • We are interested in improving our predictions • Let’s determine whether knowledge of one of these independent variables helps our predictions • Simple Regression is a procedure for predicting data, minimizing the sum of squared errors of prediction based on one independent variable

  11. Correlation Coefficient • Correlation coefficient (r) describes the linear relationship between two variables • Two variables are said to be correlated if changes in one variable are associated with changes in the other variable • What are the properties of the correlation coefficient?

  12. Matrix Plot

  13. Correlation Matrix Y v1 v2 v1 0.866 0.005 v2 0.829 0.673 0.011 0.068 v3 0.342 0.192 0.301 0.407 0.649 0.469 Cell Contents: Pearson correlation P-Value

  14. Simple Regression Results The regression equation is Y = 2.87 + 0.971 v1 Predictor Coef SE Coef T P Constant 2.871 1.029 2.79 0.032 v1 0.9714 0.2286 4.25 0.005 S = 0.9562 R-Sq = 75.1% R-Sq(adj) = 70.9%

  15. Motivating Example (cont)

  16. Confidence Interval for Prediction • Because we did not achieve perfect prediction, we also need to estimate the range of predicted values we might expect • Point estimate is our best estimate of the dependent variable • From this point estimate, we can also calculate the range of predicted values based on a measure of the prediction errors we expect to make • For example: • The predicted number of credit cards for the average family size of 4.25 is 7.00 • The expected range (95% prediction interval) is (4.518, 9.482)

  17. Prediction using Several Variables • Just demonstrated how simple regression helped improve our prediction of credit car usage • By using data on family size, our predictions are much more accurate than using the simple arithmetic average • Can we improve our prediction even further by using additional data?

  18. Impact of Multicollinearity • Ability of additional independent variables to improve prediction depends on: • Correlation between dependent and independent variables • Correlation between independent variables • Multicollinearity: association between independent variables • Impact: • Reduce any single independent variable’s predictive power • As collinearity increases, unique variance explained decreases

  19. Multiple Regression Equation The regression equation is Y = 0.48 + 0.632 v1 + 0.216 v2 Predictor Coef SE Coef T P Constant 0.482 1.461 0.33 0.755 v1 0.6322 0.2523 2.51 0.054 v2 0.2158 0.1080 2.00 0.102 S = 0.7810 R-Sq = 86.1% R-Sq(adj) = 80.6%

  20. Multiple Regression Equation The regression equation is Y = 0.29 + 0.635 v1 + 0.200 v2 + 0.272 v3 Predictor Coef SE Coef T P Constant 0.286 1.606 0.18 0.867 v1 0.6346 0.2710 2.34 0.079 v2 0.1995 0.1194 1.67 0.170 v3 0.2716 0.4702 0.58 0.594 S = 0.8389 R-Sq = 87.2% R-Sq(adj) = 77.6%

  21. Summary • Regression analysis is a simple dependence technique that can provide both prediction and explanation

  22. Decision Process for Multiple Regression • STAGE 1: Define objectives • STAGE 2: Research design • STAGE 3: Modeling Assumptions • STAGE 4: Estimate Model and Assess Fit • STAGE 5: Interpret Regression Model • STAGE 6: Validate Results

  23. Define Objectives • Objective of multiple regression • Form the optimal predictor of the dependent measure • Provide an objective means of assessing the predictive power of a set of variables • Means of objectively assessing the degree and direction of relationship between dependent and independent variables • Provides insights into the relationships among independent variables in their predictive ability • Appropriate when we are interested in a statistical (not functional) relationship • Variable selection • Ultimate success depends on selecting meaningful variables • Measurement error: degree that the dependent variable is an accurate measure of the concept being studied • Specification error: Inclusion of irrelevant variable or omission of relevant variables

  24. Research Design • Researcher must consider • Sample size • Nature of independent variables • Possible creation of new variables • Incorporate dummy variables • Represent categories in the model • Requires k-1 variables to represent k categories • Represent curvilinear effects with transformations or polynomials

  25. Dummy Variables • Variables used to represent categorical variables • Two categories: 0=male; 1=female • k categories (urban, suburban, rural) • Requires k-1 variables • Choose a base category (residence=urban)

  26. Dummy Variables (cont.) • Interpretation: Regression coefficient represents the expected difference in the dependent variable between the category and base category…holding all other variables constant • Example: A regression model relates a person’s percentage salary increase to seniority in years, gender, and location (urban, suburban, or rural)

  27. Dummy Variable (cont.)

  28. Transformations to Linearity • If the relationship between an independent and dependent variable is not linear, we can straighten it out • Transformations typically done by trial-and-error • Square Root • Logarithm • Inverse • Polynomial terms • Key features to look for • Is the relation nonlinear? • Is there a pattern of increasing variability along the vertical axis?

  29. Transformations to Linearity (cont.)

  30. Transformations to Linearity (cont.)

  31. Adding Polynomial Terms • Polynomials are power transformations of the original variables • Any number of nonlinear components may be added depending on the relationship • Each new polynomial term is entered into the regression equation and has its significance assessed

  32. Adding Polynomial Terms (cont.)

  33. Adding Polynomial Terms (cont.) • The regression equation is • Y = - 589 + 107 X • Predictor Coef SE Coef T P • Constant -588.67 42.32 -13.91 0.000 • X 106.995 2.700 39.63 0.000 • S = 166.9 R-Sq = 94.1%

  34. Adding Polynomial Terms (cont.) • The regression equation is • Y = - 28.0 + 11.9 X + 3.30 X^2 • Predictor Coef SE Coef T P • Constant -28.02 62.32 -0.45 0.654 • X 11.853 9.502 1.25 0.215 • X^2 3.2961 0.3226 10.22 0.000 • S = 116.4 R-Sq = 97.2% R-Sq(adj) = 97.1%

  35. Transform or Polynomial??? • Data transformations are useful only in simple curvilinear relationships • Do not provide a statistic means for assessing appropriateness • Only accommodate univariate terms • Polynomials are restrictive with small sample sizes • Also introduces some multicollinearity • Common practice: • Start with the linear component • Sequentially add higher-order polynomials until non-significance is achieved

  36. Interaction Effects • Occurs when one independent variable changes across values of another independent variable • Example: Might expect a relationship between family income and family size • Expect change in credit card usage based on family size might be lower for families with low incomes and higher for families with higher incomes • Without the interaction term, we are assuming that family size has a constant effect on the number of cards used

  37. Modeling Assumptions • Linearity of the phenomenon measured • Review scatterplots of dependent versus independent variables • Constant variance of the error terms • Review residuals versus fitted values plot • Review residuals versus independent variable plot • Independence of the error terms • Review residuals versus order plot • Normality of the error term distribution • Review histogram of residuals

  38. Estimate Model and Assess Fit • Model Selection • Confirmatory specification • Sequential Search methods • Stepwise • Forward • Backward • Testing regression analysis for meeting the assumptions • Identifying influential observations

  39. Interpret Regression Model • Evaluate regression model that was estimated • Assess and interpret regression model coefficients • Evaluate potential variables that were omitted during model selection • Multicollinearity can have an effect

  40. Assessing Multicollinearity • Key issue in interpreting regression analysis is the correlation among the independent variables • In most situations, multicollinearity will exist • Researcher needs to • Assess degree • Determine its impact • Determine an appropriate remedy

  41. Effects of Multicollinearity • Effects on explanation • Limits size of R2 • Makes it difficult to add unique explanatory prediction from additional variables • Makes determining the contribution of each variable difficult • Effects of variables are “mixed” • Estimation • Can prevent coefficient estimation (in extreme cases) • Coefficients incorrectly estimated • Coefficients having wrong signs

  42. Identifying Multicollinearity • Variance Inflation Factor (VIF) • Tells us the degree to which each independent variable is explained by the others • VIF = 1 indicates no relation; VIF > 1, otherwise • VIF > 5 suggests coefficients are estimated incorrectly • The largest VIF among all predictors is often used as an indicator of severe multicollinearity

  43. Remedies for Multicollinearity • Omit one or more highly correlated variables • Could create specification error • Use the model for prediction only • Use the correlations to understand the variable relationships • Use a more sophisticated method of analysis • Beyond the scope of this class

  44. Validate Results • Best guide to model “quality” is testing it on a set of previously validated data • Test on additional or “held-out” data • Address all outliers and influential points • All assumptions are met

  45. Illustration of a Regression Analysis • Marketing managers of a company are having difficulty in evaluating the field sales representatives’ performance • Reps travel among the outlets and create displays trying to increase sales volume • Job involves lots of travel time

  46. Illustration Data • Data are collected on 51 reps • DATA: • District Number • Profit (rep’s net profit margin) • Area (thousands of square miles) • POPN: (millions of people in district) • OUTLETS: (number of outlets in district) • COMMIS: (1=full commission; 0=partially salaried)

  47. Next Time… • Conclude regression analysis • Introduction to time series regression • Review for final exam

More Related