Chapter 11: Simple Linear Regression

Chapter 11: Simple Linear Regression

Where We’ve Been • Presented methods for estimating and testing population parameters for a single sample • Extended those methods to allow for a comparison of population parameters for multiple samples McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

Where We’re Going • Introduce the straight-line linear regression model as a means of relating one quantitative variable to another quantitative variable • Introduce the correlation coefficient as a means of relating one quantitative variable to another quantitative variable • Assess how well the simple linear regression model fits the sample data • Use the simple linear regression model to predict the value of one variable given the value of another variable McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.1: Probabilistic Models McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.1: Probabilistic Models The relationship between home runs and runs in baseball seems at first glance to be deterministic … McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.1: Probabilistic Models But if you consider how many runners are on base when the home run is hit, or even how often the batter misses a base and is called out, the rigid model becomes more variable. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.1: Probabilistic Models General Form of Probabilistic Models y = Deterministic component + Random error where y is the variable of interest, and the mean value of the random error is assumed to be 0: E(y) = Deterministic component. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.1: Probabilistic Models McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.1: Probabilistic Models • The goal of regression analysis is to find the straight line that comes closest to all of the points in the scatter plot simultaneously. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.1: Probabilistic Models • A First-Order Probabilistic Model y = 0 + 1x +  where y = dependent variable x = independent variable 0 + 1x = E(y) = deterministic component  = random error component 0 = y – intercept 1 = slope of the line McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.1: Probabilistic Models 0, the y – intercept, and 1, theslope of the line, are population parameters, and invariably unknown. Regression analysis is designed to estimate these parameters. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.2: Fitting the Model: The Least Squares Approach Step 1 Hypothesize the deterministic component of the probabilistic model E(y) = 0 + 1x Step 2 Use sample data to estimate the unknown parameters in the model McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.2: Fitting the Model: The Least Squares Approach Values on the line are the predicted values of total offerings given the average offering. The distances between the scattered dots and the line are the errors of prediction. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.2: Fitting the Model: The Least Squares Approach Values on the line are the predicted values of total offerings given the average offering. The line’s estimated parameters are the values that minimize the sum of the squared errors of prediction, and the method of finding those values is called the method of least squares. The distances between the scattered dots and the line are the errors of prediction. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.2: Fitting the Model: The Least Squares Approach • Model: • Estimates: • Deviation: • SSE: McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.2: Fitting the Model: The Least Squares Approach • The least squares line is the line that has the following two properties: • The sum of the errors (SE) equals 0. • The sum of squared errors (SSE) is smaller than that for any other straight-line model. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.2: Fitting the Model: The Least Squares Approach McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.2: Fitting the Model: The Least Squares Approach Can home runs be used to predict errors? Is there a relationship between the number of home runs a team hits and the quality of its fielding? McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.2: Fitting the Model: The Least Squares Approach McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.2: Fitting the Model: The Least Squares Approach These results suggest that teams which hit more home runs are (slightly) better fielders (maybe not what we expected). There are, however, only five observations in the sample. It is important to take a closer look at the assumptions we made and the results we got. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.3: Model Assumptions Assumptions 1. The mean of the probability distribution of  is 0. 2. The variance,  2, of the probability distribution of  is constant. 3. The probability distribution of  is normal. 4. The values of  associated with any two values of y are independent. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.3: Model Assumptions • The variance,  2, is used in every test statistic and confidence interval used to evaluate the model. • Invariably,  2 is unknown and must be estimated. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.3: Model Assumptions McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 Note: There may be many different patterns in the scatter plot when there is no linear relationship. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 A critical step in the evaluation of the model is to test whether 1 = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . y y y x x x Positive Relationship No Relationship Negative Relationship 1> 0 1 = 0 1 < 0 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 H0 : 1 = 0 Ha : 1 ≠ 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . y y y x x x Positive Relationship No Relationship Negative Relationship 1> 0 1 = 0 1 < 0 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 • The four assumptions described above produce a normal sampling distribution for the slope estimate: called the estimated standard error of the least squares slope estimate. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 • Since the t-value does not lead to rejection of the null hypothesis, we can conclude that • A different set of data may yield different results. • There is a more complicated relationship. • There is no relationship (non-rejection does not lead to this conclusion automatically). McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 • Interpreting p-Values for  • Software packages report two-tailed p-values. • To conduct one-tailed tests of hypotheses, the reported p-values must be adjusted: McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 • A Confidence Interval on 1 where the estimated standard error is and t/2 is based on (n – 2) degrees of freedom McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.4: Assessing the Utility of the Model: Making Inferences about the Slope 1 • In the home runs and errors example, the estimated 1was -.0521, and the estimated standard error was .569. With 3 degrees of freedom, t = 3.182. • The confidence interval is, therefore, which includes 0, so there may be no relationship between the two variables. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.5: The Coefficients of Correlation and Determination • The coefficient of correlation,r, is a measure of the strength of the linear relationship between two variables. It is computed as follows: McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.5: The Coefficients of Correlation and Determination Positive linear relationship No linear relationship Negative linear relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . y y y x x x r → +1 r  0 r → -1 Values of requal to +1 or -1 require each point in the scatter plot to lie on a single straight line. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.5: The Coefficients of Correlation and Determination • In the example about homeruns and errors, SSxy= -143.8 and SSxx= 2509. • SSyy is computed as so McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.5: The Coefficients of Correlation and Determination • An r value that close to zero suggests there may not be a linear relationship between the variables, which is consistent with our earlier look at the null hypothesis and the confidence interval on  1. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.5: The Coefficients of Correlation and Determination • The coefficient of determination, r2, represents the proportion of the total sample variability around the mean of y that is explained by the linear relationship between x and y. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.5: The Coefficients of Correlation and Determination McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.6: Using the Model for Estimation and Prediction McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.6: Using the Model for Estimation and Prediction • Based on our model results, a team that hits 140 home runs is expected to make 99.9 errors: McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.6: Using the Model for Estimation and Prediction • A 95% Prediction Interval for an Individual Team’s Errors McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.6: Using the Model for Estimation and Prediction Prediction intervals for individual new values of y are wider than confidence intervals on the mean of y because of the extra source of error. McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.6: Using the Model for Estimation and Prediction • Estimating y beyond the range of values associated with the observed values of x can lead to large prediction errors. • Beyond the range of observed x values, the relationship may look very different. Estimated relationship True relationship Xi Xj Range of observed values of x McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.7: A Complete Example • Step 1 • How does the proximity of a fire house (x) affect the damages (y) from a fire? • y = f(x) • y = 0 +1x +  McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

11.7: A Complete Example McClave: Statistics, 11th ed. Chapter 11: Simple Linear Regression

Chapter 11: Simple Linear Regression