Review

Review

Fitting Equations to Data

The Multiple Linear Regression Model An important statistical model

In Multiple Linear Regression we assume the following model Y = b0 + b1 X1 + b2 X2 + ... + bp Xp + e This model is called the Multiple Linear Regression Model. Again are unknown parameters of the model and where b0, b1, b2, ... , bp are unknown parameters and e is a random disturbance assumed to have a normal distribution with mean 0 and standard deviation s.

The importance of the Linear model 1. It is the simplest form of a model in which each independent variable has some effect on the .dependent variable Y. When fitting models to data one tries to find the simplest form of a model that still adequately describes the relationship between the dependent variable and the independent variables. The linear model is sometimes the first model to be fitted and only abandoned if it turns out to be inadequate.

In many instances a linear model is the most appropriate model to describe the dependence relationship between the dependent variable and the independent variables. This will be true if the dependent variable increases at a constant rate as any or the independent variables is increased while holding the other independent variables constant.

3. Many non-Linear models can be put into the form of a Linear model by appropriately transforming the dependent variables and/or any or all of the independent variables. This important fact ensures the wide utility of the Linear model. (i.e. the fact the many non-linear models are linearizable.)

Summary of the Statistics used in Multiple Regression

The Least Squares Estimates: - The values that minimize Note: = predicted value of yi

The Analysis of Variance Table Entries a)Adjusted Total Sum of Squares(SSTotal) b) Residual Sum of Squares(SSError) c) Regression Sum of Squares(SSReg) Note: i.e. SSTotal = SSReg +SSError

The Analysis of Variance Table Source Sum of Squares d.f. Mean Square F Regression SSReg p SSReg/p = MSReg MSReg/s2 Error SSErrorn-p-1SSError/(n-p-1) =MSError = s2 Total SSTotal n-1

Testing for Hypotheses related to Multiple Regression.

When testing hypotheses there are two models of interest. 1. The Complete Model Y = b0 + b1X1 + b2X2 + b3X3 +... + bpXp+ e 2. The Reduced Model The model implied by H0. You are interested in knowing whether the complete model can be simplified to the reduced model.

Some Comments • The complete model contains more parameters and will always provide a better fit to the data than the reducedmodel. • The Residual Sum of Squares for the complete model will always be smaller than the R.S.S. for the reduced model. • If the reduction in the R.S,S. is small as we change from the reduced model to the complete model, the reduced model should be accepted as providing an adequate fit. • If the reduction in the R.S,S. is large as we change from the reduced model to the complete model, the reduced model should be rejected as providing an adequate fit and the complete model should be kept. These principles form the basis for the following test.

Testing the General Linear Hypothesis The F-test for H0 is performed by carrying out two runs of a multiple regression package.

Run 1: Fit the complete model. Resulting in the following Anova Table: Source df Sum of Squares Regression p SSReg Residual (Error) n-p-1 SSError Total n-1 SSTotal

Run 2: Fit the reduced model (q parameters eliminated) Resulting in the following Anova Table: Source df Sum of Squares Regression p-q SS1Reg Residual (Error) n-p+q-1 SS1Error Total n-1 SSTotal

The Test: The Test is carried out using the Test Statistic where SSH0 = SS1Error- SSError= SSReg- SS1Reg and s2 = SSError/(n-p-1). The test statistic, F, has an F-distribution with n1 = q d.f. in the numerator and n2 = n – p - 1d.f. in the denominator if H0 is true.

The Anova Table for the Test: Source df Sum of Squares Mean Square F Regression p-q SS1Reg [1/(p-q)]SS1Reg MS1Reg/s2 (for the reduced model) Departure q SSH0 (1/q)SSH0 MSH0/s2 from H0 Residual n-p-1 SSError s2 (Error) Total n-1 SSTotal

The Use of Dummy Variables

In the examples so far the independent variables are continuous numerical variables. • Suppose that some of the independent variables are categorical. • Dummy variables are artificially defined variables designed to convert a model including categorical independent variables to the standard multiple regression model.

Example:Comparison of Slopes of k Regression Lines with Common Intercept

Situation: • k treatments or k populations are being compared. • For each of the k treatments we have measured both • Y (the response variable) and • X (an independent variable) • Y is assumed to be linearly related to X with • the slope dependent on treatment (population), while • the intercept is the same for each treatment

The Model:

This model can be artificially put into the form of the Multiple Regression model by the use of dummy variables to handle the categorical independent variable Treatments. • Dummy variables are variables that are artificially defined

In this case we define a new variable for each category of the categorical variable. That is we will define Xi for each category of treatments as follows:

Then the model can be written as follows: The Complete Model: where

In this case Dependent Variable: Y Independent Variables: X1, X2, ... , Xk

In the above situation we would likely be interested in testing the equality of the slopes. Namely the Null Hypothesis (q = k – 1)

The Reduced Model: Dependent Variable: Y Independent Variable: X = X1+ X2+... + Xk

Example:Comparison of Intercepts of k Regression Lines with a Common Slope (One-way Analysis of Covariance)

Situation: • k treatments or k populations are being compared. • For each of the k treatments we have measured both Y (then response variable) and X (an independent variable) • Y is assumed to be linearly related to X with the intercept dependent on treatment (population), while the slope is the same for each treatment. • Y is called the response variable, while X is called the covariate.

The Model:

In this case we define a new variable for each category of the categorical variable. That is we will define Xi for categories I i = 1, 2, …, (k – 1) of treatments as follows:

Then the model can be written as follows: The Complete Model: where

In this case Dependent Variable: Y Independent Variables: X1, X2, ... , Xk-1, X

In the above situation we would likely be interested in testing the equality of the intercepts. Namely the Null Hypothesis (q = k – 1)

The Reduced Model: Dependent Variable: Y Independent Variable: X

The F Test

The Analysis of Covariance • This analysis can also be performed by using a package that can perform Analysis of Covariance (ANACOVA) • The package sets up the dummy variables automatically

Another application of the use of dummy variables • The dependent variable, Y, is linearly related to X, but the slope changes at one or several known values of X (nodes). Y X nodes

bk Y b2 b1 X x1 x2 xk The model or

Now define Etc.

Then the model can be written

Multiple Regression Selecting the Best Equation

Techniques for Selecting the "Best" Regression Equation • The best Regression equation is not necessarily the equation that explains most of the variance in Y (the highest R2). • This equation will be the one with all the variables included. • The best equation should also be simple and interpretable. (i.e. contain a small no. of variables). • Simple (interpretable) & Reliable - opposing criteria. • The best equation is a compromise between these two.

We will discuss several strategies for selecting the best equation: • All Possible Regressions Uses R2, s2, Mallows Cp Cp = RSSp/s2complete - [n-2(p+1)] • "Best Subset" Regression Uses R2,Ra2, Mallows Cp • Backward Elimination • Stepwise Regression

I All Possible Regressions • Suppose we have the p independent variables X1, X2, ..., Xp. • Then there are 2p subsets of variables

Variables in EquationModel no variables Y = b0 + e X1 Y = b0 + b1 X1+ e X2 Y = b0 + b2 X2+ e X3 Y = b0 + b3 X3+ e X1, X2 Y = b0 + b1 X1+ b2 X2+ e X1, X3 Y = b0 + b1 X1+ b3 X3+ e X2, X3 Y = b0 + b2 X2+ b3 X3+ e and X1, X2, X3 Y = b0 + b1 X1+ b2 X2+ b2 X3+ e

Use of R2 1. Assume we carry out 2p runs for each of the subsets. Divide the Runs into the following sets Set 0: No variables Set 1: One independent variable. ... Set p: p independent variables. 2. Order the runs in each set according to R2. 3. Examine the leaders in each run looking for consistent patterns - take into account correlation between independent variables.

Review

Review

Presentation Transcript

Review

Review

Review

Review

Review

Review, REVIEW!

Review Notes Lecture Review

REVIEW, REVIEW, REVIEW!!

Review

Review

Review

Review

ACT Review Paragraphs Review

Review

review

Review

Geometry Review CRCT Review

review

Review Trust Review

Review