1 / 80

# Chapter 13 - PowerPoint PPT Presentation

Chapter 13. Simple Linear Regression Analysis. Simple Linear Regression. 13.1 The Simple Linear Regression Model and the Least Square Point Estimates 13.2 Model Assumptions and the Standard Error 13.3 Testing the Significance of Slope and y-Intercept 13.4 Confidence and Prediction Intervals.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Chapter 13' - janna-mcconnell

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Chapter 13

Simple Linear Regression Analysis

13.1 The Simple Linear Regression Model and the Least Square Point Estimates

13.2 Model Assumptions and the Standard Error

13.3 Testing the Significance of Slope and y-Intercept

13.4 Confidence and Prediction Intervals

Simple Linear Regression Continued

13.5 Simple Coefficients of Determination and Correlation

13.6 Testing the Significance of the Population Correlation Coefficient (Optional)

13.7 An F Test for the Model

13.8 Residual Analysis (Optional)

13.9 Some Shortcut Formulas (Optional)

The Simple Linear Regression Model and the Least Squares Point Estimates

• The dependent (or response) variable is the variable we wish to understand or predict

• The independent (or predictor) variable is the variable we will use to understand or predict the dependent variable

• Regression analysis is a statistical technique that uses observed data to relate the dependent variable to one or more independent variables

Objective of Regression Analysis Point Estimates

The objective of regression analysis is to build a regression model (or predictive equation) that can be used to describe, predict and control the dependent variable on the basis of the independent variable

Example 13.1: Fuel Consumption Point EstimatesCase #1

Example 13.1: Fuel Consumption Point EstimatesCase #2

Example 13.1: Fuel Consumption Point EstimatesCase #3

Example 13.1: Fuel Consumption Point EstimatesCase #4

• The values of β0 and β1 determine the value of the mean weekly fuel consumption μy|x

• Because we do not know the true values of β0 and β1, we cannot actually calculate the mean weekly fuel consumptions

• We will learn how to estimate β0 and β1 in the next section

• For now, when we say that μy|x is related to x by a straight line, we mean the different mean weekly fuel consumptions and average hourly temperatures lie in a straight line

Form of The Simple Linear Regression Model Point Estimates

• y = β0 + β1x + ε

• y = β0 + β1x + ε is the mean value of the dependent variable y when the value of the independent variable is x

• β0 is the y-intercept; the mean of y when x is 0

• β1 is the slope; the change in the mean of y per unit change in x

• εis an error term that describes the effect on y of all factors other than x

Regression Terms Point Estimates

• β0 and β1 are called regression parameters

• β0 is the y-intercept and β1 is the slope

• We do not know the true values of these parameters

• So, we must use sample data to estimate them

• b0 is the estimate of β0 and b1 is the estimate of β1

The Least Squares Estimates, and Point EstimatesPoint Estimation and Prediction

• The true values of β0 and β1 are unknown

• Therefore, we must use observed data to compute statistics that estimate these parameters

• Will compute b0 to estimate β0 and b1 to estimate β1

The Least Squares Point Estimates Point Estimates

• Estimation/prediction equationŷ = b0 + b1x

• Least squares point estimate of the slope β1

The Least Squares Point Estimates Point EstimatesContinued

• Least squares point estimate of the y-intercept 0

Example 13.3: Fuel Consumption Case #2 Point Estimates

• From last slide,

• Σyi = 81.7

• Σxi = 351.8

• Σx2i = 16,874.76

• Σxiyi = 3,413.11

• Once we have these values, we no longer need the raw data

• Calculation of b0 and b1 uses these totals

Example 13.3: Fuel Consumption Case #5 Point Estimates

• Prediction (x = 40)

• ŷ = b0 + b1x = 15.84 + (-0.1279)(28)

• ŷ = 12.2588 MMcf of Gas

Model Assumptions Experimental Region

• Mean of ZeroAt any given value of x, the population of potential error term values has a mean equal to zero

• Constant Variance AssumptionAt any given value of x, the population of potential error term values has a variance that does not depend on the value of x

• Normality AssumptionAt any given value of x, the population of potential error term values has a normal distribution

• Independence AssumptionAny one value of the error term ε is statistically independent of any other value of ε

Model Assumptions Illustrated Experimental Region

Sum of Squared Errors Experimental Region

Mean Square Error Experimental Region

• This is the point estimate of the residual variance σ2

• SSE is from last slide

Standard Error Experimental Region

• This is the point estimate of the residual standard deviation σ

• MSE is from last slide

Example 13.5: Fuel Consumption Experimental RegionCase

Testing the Significance of the Slope Experimental Region

• A regression model is not likely to be useful unless there is a significant relationship between x and y

• To test significance, we use the null hypothesis:H0: β1 = 0

• Versus the alternative hypothesis:Ha: β1 ≠ 0

Testing the Significance of the Slope #2 Experimental Region

If the regression assumptions hold, we can reject H0: 1 = 0 at the  level of significance (probability of Type I error equal to ) if and only if the appropriate rejection point condition holds or, equivalently, if the corresponding p-value is less than 

Testing the Significance of the Slope #4 Experimental Region

• Test Statistics

• 100(1-α)% Confidence Interval for β1[b1± t /2 Sb1]

• t, t/2 and p-values are based on n–2 degrees of freedom

Example 13.6: Fuel Consumption DataCase

• The p-value for testing H0 versus Ha is twice the area to the right of |t|=7.33 with n-2=6 degrees of freedom

• In this case, the p-value is 0.0003

• We can reject H0 in favor of Ha at level of significance 0.05, 0.01, or 0.001

• We therefore have strong evidence that x is significantly related to y and that the regression model is significant

• If the regression assumptions hold, a 100(1-) percent confidence interval for the true slope B1 is

• b1± t/2sb

• Here t is based on n - 2 degrees of freedom

Example 13.7: Fuel Consumption DataCase

• An earlier printout tells us:

• b1 = -0.12792

• sb1 = 0.01746

• We have n-2=6 degrees of freedom

• That gives us a t-value of 2.447 for a 95 percent confidence interval

• [b1± t0.025 · sb1] = [-0.12792 ± 0.01746] = [-0.1706, -0.0852]

Testing the Significance of the Datay-Intercept

If the regression assumptions hold, we can reject H0: 0 = 0 at the  level of significance (probability of Type I error equal to ) if and only if the appropriate rejection point condition holds or, equivalently, if the corresponding p-value is less than 

• The point on the regression line corresponding to a particular value of x0 of the independent variable x is ŷ = b0 + b1x0

• It is unlikely that this value will equal the mean value of y when x equals x0

• Therefore, we need to place bounds on how far the predicted value might be from the actual value

• We can do this by calculating a confidence interval mean for the value of y and a prediction interval for an individual value of y

Distance Value Data

• Both the confidence interval for the mean value of y and the prediction interval for an individual value of y employ a quantity called the distance value

• The distance value for a particular value x0 of x is

• The distance value is a measure of the distance between the value x0 of x and x

• Notice that the further x0 is from x, the larger the distance value

• Assume that the regression assumption holds

• The formula for a 100(1-a) confidence interval for the mean value of y is as follows:

• This is based on n-2 degrees of freedom

Example 13.9: Fuel Consumption DataCase

• From before:

• n = 8

• x0 = 40

• x = 43.98

• SSxx = 1,404.355

• The distance value is

Example 13.9: Fuel Consumption DataCase Continued

• From before

• x0 = 40 is 10.72 MMcf

• t = 2.447

• s = 0.6542

• Distance value is 0.1363

• The confidence interval is

A Prediction Interval for an Individual DataValue of y

• Assume that the regression assumption holds

• The formula for a 100(1-) prediction interval for an individual value of y is as follows:

• This is based on n-2 degrees of freedom

Example 13.9: Fuel Consumption DataCase

• From before

• x0 = 40 is 10.72 MMcf

• t = 2.447

• s = 0.6542

• Distance value is 0.1363

• The prediction interval is

Which to Use? Data

• The prediction interval is useful if it is important to predict an individual value of the dependent variable

• A confidence interval is useful if it is important to estimate the mean value

• The prediction interval will always be wider than the confidence interval

• How useful is a particular regression model?

• One measure of usefulness is the simple coefficient of determination

• It is represented by the symbol r2

Calculating The Simple Coefficient of Information in XDetermination

• Total variation is given by the formula (yi-ȳ)2

• Explained variation is given by the formula (ŷi-ȳ)2

• Unexplained variation is given by the formula (yi-ŷ)2

• Total variation is the sum of explained and unexplained variation

• r2 is the ratio of explained variation to total variation

What Does r Information in X2 Mean?

The coefficient of determination, r2, is the proportion of the total variation in the n observed values of the dependent variable that is explained by the simple linear regression model

The Simple Correlation Coefficient Information in X

• The simple correlation coefficient measures the strength of the linear relationship between y and x and is denoted by r

• Where b1 is the slope of the least squares line

Different Values of the Correlation Information in XCoefficient

The value of the simple correlation coefficient (r) is not the slope of the least square line

That value is estimated by b1

High correlation does not imply that a cause-and-effect relationship exists

It simply implies that x and y tend to move together in a linear fashion

Scientific theory is required to show a cause-and-effect relationship

Two Important Points

Testing the Significance of the the slope of the least square linePopulation Correlation Coefficient

• The simple correlation coefficient (r) measures the linear relationship between the observed values of x and y from the sample

• The population correlation coefficient (ρ) measures the linear relationship between all possible combinations of observed values of x and y

• r is an estimate of ρ

Testing ρ the slope of the least square line

• We can test to see if the correlation is significant using the hypothesesH0: ρ = 0Ha: ρ ≠ 0

• The statistic is

• This test will give the same results as the test for significance on the slope coefficient b1

An F Test for Model the slope of the least square line

• For simple regression, this is another way to test the null hypothesisH0: β1 = 0

• This is the only test we will use for multiple regression

• The F test tests the significance of the overall regression relationship between x and y

Mechanics of the F Test the slope of the least square line

• To test H0: β1= 0 versus Ha: β1 0 at the a level of significance

• Test statistics based on F

• Reject H0 if F(model) > Fa or p-value < a

• Fa is based on 1 numerator and n-2 denominator degrees of freedom

Mechanics of the F Test Graphically the slope of the least square line

Example 13.15: Fuel Consumption Case the slope of the least square line

F-test at  = 0.05 level of significance

Test Statistic

Reject H0 at  level of significance, since

Fais based on 1 numerator and 6 denominator degrees of freedom

Residual Analysis #1 the slope of the least square line

• Checks of regression assumptions are performed by analyzing the regression residuals

• Residuals (e) are defined as the difference between the observed value of y and the predicted value of y, e = y - ŷ

• Note that e is the point estimate of ε

• If the regression assumptions are valid, the population of potential error terms will be normally distributed with a mean of zero and a variance σ2

• Furthermore, the different error terms will be statistically independent

Residual Analysis #2 the slope of the least square line

• The residuals should look like they have been randomly and independently selected from normally distributed populations having mean zero and variance σ2

• With any real data, assumptions will not hold exactly

• Mild departures do not affect our ability to make statistical inferences

• In checking assumptions, we are looking for pronounced departures from the assumptions

• So, only require residuals to approximately fit the description above

Residual Plots the slope of the least square line

• Residuals versus independent variable

• Residuals versus predicted y’s

• Residuals in time order (if the response is a time series)

Constant Variance Assumptions the slope of the least square line

• To check the validity of the constant variance assumption, examine residual plots against

• The x values

• The predicted y values

• Time (when data is time series)

• A pattern that fans out says the variance is increasing rather than staying constant

• A pattern that funnels in says the variance is decreasing rather than staying constant

• A pattern that is evenly spread within a band says the assumption has been met

Constant Variance Visually the slope of the least square line

Example 13.16: The QHIC Case the slope of the least square lineResiduals Versus Value

Example 13.16: The QHIC Case the slope of the least square lineResiduals Versus the Fitted Values

Assumption of Correct Functional the slope of the least square lineForm

• If the relationship between x and y is something other than a linear one, the residual plot will often suggest a form more appropriate for the model

• For example, if there is a curved relationship between x and y, a plot of residuals will often show a curved relationship

Normality Assumption the slope of the least square line

• If the normality assumption holds, a histogram or stem-and-leaf display of residuals should look bell-shaped and symmetric

• Another way to check is a normal plot of residuals

• Order residuals from smallest to largest

• Plot e(i) on vertical axis against z(i)

• Z(i) is the point on the horizontal axis under the z curve so the area under this curve to the left is (3i-1)/(3n+1)

• If the normality assumption holds, the plot should have a straight-line appearance

Example 13.18: The QHIC Case the slope of the least square lineNormal Probability Plot of the Residuals

Independence Assumption the slope of the least square line

• Independence assumption is most likely to be violated when the data are time-series data

• If the data is not time series, then it can be reordered without affecting the data

• Changing the order would change the interdependence of the data

• For time-series data, the time-ordered error terms can be autocorrelated

• Positive autocorrelation is when a positive error term in time period i tends to be followed by another positive value in i+k

• Negative autocorrelation is when a positive error term in time period i tends to be followed by a negative value in i+k

• Either one will cause a cyclical error term over time

Independence Assumption Visually the slope of the least square linePositive Autocorrelation

Independence Assumption Visually the slope of the least square lineNegative Autocorrelation

Some Shortcut Formulas the slope of the least square line

where