1 / 467

Review

Learn about the multiple linear regression model and its importance in fitting models to data. Discover the use of dummy variables and hypothesis testing in multiple regression analysis.

Download Presentation

Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Review

  2. Fitting Equations to Data

  3. The Multiple Linear Regression Model An important statistical model

  4. In Multiple Linear Regression we assume the following model Y = b0 + b1 X1 + b2 X2 + ... + bp Xp + e This model is called the Multiple Linear Regression Model. Again are unknown parameters of the model and where b0, b1, b2, ... , bp are unknown parameters and e is a random disturbance assumed to have a normal distribution with mean 0 and standard deviation s.

  5. The importance of the Linear model 1.     It is the simplest form of a model in which each independent variable has some effect on the .dependent variable Y. When fitting models to data one tries to find the simplest form of a model that still adequately describes the relationship between the dependent variable and the independent variables. The linear model is sometimes the first model to be fitted and only abandoned if it turns out to be inadequate.

  6. In many instances a linear model is the most appropriate model to describe the dependence relationship between the dependent variable and the independent variables. This will be true if the dependent variable increases at a constant rate as any or the independent variables is increased while holding the other independent variables constant.

  7. 3.     Many non-Linear models can be put into the form of a Linear model by appropriately transforming the dependent variables and/or any or all of the independent variables. This important fact ensures the wide utility of the Linear model. (i.e. the fact the many non-linear models are linearizable.)

  8. Summary of the Statistics used in Multiple Regression

  9. The Least Squares Estimates: - The values that minimize Note: = predicted value of yi

  10. The Analysis of Variance Table Entries a)Adjusted Total Sum of Squares(SSTotal) b) Residual Sum of Squares(SSError) c) Regression Sum of Squares(SSReg) Note: i.e. SSTotal = SSReg +SSError

  11. The Analysis of Variance Table Source Sum of Squares d.f. Mean Square F Regression SSReg p SSReg/p = MSReg MSReg/s2 Error SSErrorn-p-1SSError/(n-p-1) =MSError = s2 Total SSTotal n-1

  12. Testing for Hypotheses related to Multiple Regression.

  13. When testing hypotheses there are two models of interest. 1. The Complete Model Y = b0 + b1X1 + b2X2 + b3X3 +... + bpXp+ e 2. The Reduced Model The model implied by H0. You are interested in knowing whether the complete model can be simplified to the reduced model.

  14. Some Comments • The complete model contains more parameters and will always provide a better fit to the data than the reducedmodel. • The Residual Sum of Squares for the complete model will always be smaller than the R.S.S. for the reduced model. • If the reduction in the R.S,S. is small as we change from the reduced model to the complete model, the reduced model should be accepted as providing an adequate fit. • If the reduction in the R.S,S. is large as we change from the reduced model to the complete model, the reduced model should be rejected as providing an adequate fit and the complete model should be kept. These principles form the basis for the following test.

  15. Testing the General Linear Hypothesis The F-test for H0 is performed by carrying out two runs of a multiple regression package.

  16. Run 1: Fit the complete model. Resulting in the following Anova Table: Source df Sum of Squares Regression p SSReg Residual (Error) n-p-1 SSError Total n-1 SSTotal

  17. Run 2: Fit the reduced model (q parameters eliminated) Resulting in the following Anova Table: Source df Sum of Squares Regression p-q SS1Reg Residual (Error) n-p+q-1 SS1Error Total n-1 SSTotal

  18. The Test: The Test is carried out using the Test Statistic where SSH0 = SS1Error- SSError= SSReg- SS1Reg and s2 = SSError/(n-p-1). The test statistic, F, has an F-distribution with n1 = q d.f. in the numerator and n2 = n – p - 1d.f. in the denominator if H0 is true.

  19. The Anova Table for the Test: Source df Sum of Squares Mean Square F Regression p-q SS1Reg [1/(p-q)]SS1Reg MS1Reg/s2 (for the reduced model) Departure q SSH0 (1/q)SSH0 MSH0/s2 from H0 Residual n-p-1 SSError s2 (Error) Total n-1 SSTotal

  20. The Use of Dummy Variables

  21. In the examples so far the independent variables are continuous numerical variables. • Suppose that some of the independent variables are categorical. • Dummy variables are artificially defined variables designed to convert a model including categorical independent variables to the standard multiple regression model.

  22. Example:Comparison of Slopes of k Regression Lines with Common Intercept

  23. Situation: • k treatments or k populations are being compared. • For each of the k treatments we have measured both • Y (the response variable) and • X (an independent variable) • Y is assumed to be linearly related to X with • the slope dependent on treatment (population), while • the intercept is the same for each treatment

  24. The Model:

  25. This model can be artificially put into the form of the Multiple Regression model by the use of dummy variables to handle the categorical independent variable Treatments. • Dummy variables are variables that are artificially defined

  26. In this case we define a new variable for each category of the categorical variable. That is we will define Xi for each category of treatments as follows:

  27. Then the model can be written as follows: The Complete Model: where

  28. In this case Dependent Variable: Y Independent Variables: X1, X2, ... , Xk

  29. In the above situation we would likely be interested in testing the equality of the slopes. Namely the Null Hypothesis (q = k – 1)

  30. The Reduced Model: Dependent Variable: Y Independent Variable: X = X1+ X2+... + Xk

  31. Example:Comparison of Intercepts of k Regression Lines with a Common Slope (One-way Analysis of Covariance)

  32. Situation: • k treatments or k populations are being compared. • For each of the k treatments we have measured both Y (then response variable) and X (an independent variable) • Y is assumed to be linearly related to X with the intercept dependent on treatment (population), while the slope is the same for each treatment. • Y is called the response variable, while X is called the covariate.

  33. The Model:

  34. In this case we define a new variable for each category of the categorical variable. That is we will define Xi for categories I i = 1, 2, …, (k – 1) of treatments as follows:

  35. Then the model can be written as follows: The Complete Model: where

  36. In this case Dependent Variable: Y Independent Variables: X1, X2, ... , Xk-1, X

  37. In the above situation we would likely be interested in testing the equality of the intercepts. Namely the Null Hypothesis (q = k – 1)

  38. The Reduced Model: Dependent Variable: Y Independent Variable: X

  39. The F Test

  40. The Analysis of Covariance • This analysis can also be performed by using a package that can perform Analysis of Covariance (ANACOVA) • The package sets up the dummy variables automatically

  41. Another application of the use of dummy variables • The dependent variable, Y, is linearly related to X, but the slope changes at one or several known values of X (nodes). Y X nodes

  42. bk Y b2 b1 X x1 x2 xk The model or

  43. Now define Etc.

  44. Then the model can be written

  45. Multiple Regression Selecting the Best Equation

  46. Techniques for Selecting the "Best" Regression Equation • The best Regression equation is not necessarily the equation that explains most of the variance in Y (the highest R2). • This equation will be the one with all the variables included. • The best equation should also be simple and interpretable. (i.e. contain a small no. of variables). • Simple (interpretable) & Reliable - opposing criteria. • The best equation is a compromise between these two.

  47. We will discuss several strategies for selecting the best equation: • All Possible Regressions Uses R2, s2, Mallows Cp   Cp = RSSp/s2complete - [n-2(p+1)] • "Best Subset" Regression Uses R2,Ra2, Mallows Cp • Backward Elimination • Stepwise Regression

  48. I All Possible Regressions • Suppose we have the p independent variables X1, X2, ..., Xp. • Then there are 2p subsets of variables

  49. Variables in EquationModel no variables Y = b0 + e X1 Y = b0 + b1 X1+ e X2 Y = b0 + b2 X2+ e X3 Y = b0 + b3 X3+ e X1, X2 Y = b0 + b1 X1+ b2 X2+ e X1, X3 Y = b0 + b1 X1+ b3 X3+ e X2, X3 Y = b0 + b2 X2+ b3 X3+ e and X1, X2, X3 Y = b0 + b1 X1+ b2 X2+ b2 X3+ e

  50. Use of R2 1. Assume we carry out 2p runs for each of the subsets. Divide the Runs into the following sets Set 0: No variables Set 1: One independent variable. ... Set p: p independent variables. 2. Order the runs in each set according to R2. 3. Examine the leaders in each run looking for consistent patterns - take into account correlation between independent variables.

More Related