Chapter 13

378 Views

Download Presentation
## Chapter 13

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Chapter 13**Simple Linear Regression & Correlation Inferential Methods © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Deterministic Models**• Consider the two variables x and y. A deterministic relationship is one in which the value of y (the dependent variable) is described by some formula or mathematical notation such as y = f(x), y = 3 + 2 x or • y = 5e-2x where x is the dependent variable. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Probabilistic Models**• A description of the relation between two variables x and y that are not deterministically related can be given by specifying a probabilistic model. • The general form of an additive probabilistic model allows y to be larger or smaller than f(x) by a random amount, e. • The model equation is of the form • Y = deterministic function of x + random deviation • = f(x) + e © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**e=-1.5**Probabilistic Models Deviations from the deterministic part of a probabilistic model © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Simple Linear Regression Model**The simple linear regression model assumes that there is a line with vertical or y intercept a and slope b, called the true or population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made, y = + x + e Without the random deviation e, all observed points (x, y) points would fall exactly on the population regression line. The inclusion of e in the model equation allows points to deviate from the line by random amounts. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Population regression line**(Slope ) Observation when x = x1 (positive deviation) e2 e2 Observation when x = x2 (positive deviation) a = vertical intercept 0 x = x1 x = x2 0 Simple Linear Regression Model © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Basic Assumptions of the Simple Linear Regression Model**• The distribution of e at any particular x value has mean value 0 (µe = 0). • The standard deviation of e (which describes the spread of its distribution) is the same for any particular value of x. This standard deviation is denoted by . • The distribution of e at any particular x value is normal. • The random deviations e1, e2, …, en associated with different observations are independent of one another. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**For any fixed x value, y itself has a normal distribution.**More About the Simple Linear Regression Model and (standard deviation of y for fixed x) = . © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Small s**Large s Interpretation of Terms The slope of the population regression line is the mean (average) change in y associated with a 1-unit increase in x. The vertical intercept a is the height of the population line when x = 0. The size of determines the extent to which the (x, y) observations deviate from the population line. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Illustration of Assumptions**© 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Estimates for the Regression Line**The point estimates of b, the slope, and a, the y intercept of the population regression line, are the slope and y intercept, respectively, of the least squares line. That is, © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Interpretation of y = a + bx**• Let x* denote a specific value of the predictor variable x. The a + bx* has two interpetations: • a + bx* is a point estimate of the mean y value when x = x*. • a + bx* is a point prediction of an individual y value to be observed when x = x*. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example**The following data was collected in a study of age and fatness in humans. One of the questions was, “What is the relationship between age and fatness?” * Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual-photon (153Gd) absorptiometry. American Journal of Clinical Nutrition, 40, 834-839 © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example**© 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example**© 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example**© 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example**A point estimate for the %Fat for a human who is 45 years old is If 45 is put into the equation for x, we have both an estimated %Fat for a 45 year old human or an estimated average %Fat for 45 year old humans The two interpretations are quite different. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example**A plot of the data points along with the least squares regression line created with Minitab is given to the right. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Terminology**© 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Definition formulae**The total sum of squares, denoted by SSTo, is defined as The residual sum of squares, denoted by SSResid, is defined as © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Calculation Formulae Recalled**SSTo and SSResid are generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas: © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Coefficient of Determination**• The coefficient of determination, denoted by r2, gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**The statistic for estimating the variance s2 is**where Estimated Standard Deviation, se © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Estimated Standard Deviation, se**The estimate of s is the estimated standard deviation The number of degrees of freedom associated with estimating 2 or in simple linear regression is n - 2. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example continued**© 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example continued**© 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example continued**With r2 = 0.627 or 62.7%, we can say that 62.7% of the observed variation in %Fat can be attributed to the probabilistic linear relationship with human age. The magnitude of a typical sample deviation from the least squares line is about 5.75(%) which is reasonably large compared to the y values themselves. This would suggest that the model is only useful in the sense of provide gross “ballpark” estimates for %Fat for humans based on age. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**The standard deviation of the statistic b is**Properties of the Sampling Distribution of b When the four basic assumptions of the simple linear regression model are satisfied, the following conditions are met: • The mean value of b is . Specifically, • mb=b and hence b is an unbiased statistic for estimating • The statistic b has a normal distribution (a consequence of the error e being normally distributed) © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**The estimated standard deviation of the statistic b is**When then four basic assumptions of the simple linear regression model are satisfied, the probability distribution of the standardized variable is the t distribution with df = n - 2 Estimated Standard Deviation of b © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Confidence interval for **When then four basic assumptions of the simple linear regression model are satisfied, a confidence interval for , the slope of the population regression line, has the form b (t critical value)sb where the t critical value is based on df = n - 2. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**A 95% confidence interval estimate for b is**Example continued Recall © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example continued**A 95% confidence interval estimate for b is Based on sample data, we are 95% confident that the true mean increase in %Fat associated with a year of age is between 0.324% and 0.772%. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Regression line**Estimated slope b Estimated y intercept a residual df = n -2 SSResid SSTo Example continued Minitab output looks like Regression Analysis: % Fat y versus Age (x) The regression equation is % Fat y = 3.22 + 0.548 Age (x) Predictor Coef SE Coef T P Constant 3.221 5.076 0.63 0.535 Age (x) 0.5480 0.1056 5.19 0.000 S = 5.754 R-Sq = 62.7% R-Sq(adj) = 60.4% Analysis of Variance Source DF SS MS F P Regression 1 891.87 891.87 26.94 0.000 Residual Error 16 529.66 33.10 Total 17 1421.54 © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Hypothesis Tests Concerning **Null hypothesis: H0: = hypothesized value © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Hypothesis Tests Concerning **Alternate hypothesis and finding the P-value: • Ha: > hypothesized value P-value = Area under the t curve with n - 2 degrees of freedom to the right of the calculated t • Ha: < hypothesized value P-value = Area under the t curve with n - 2 degrees of freedom to the left of the calculated t © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Hypothesis Tests Concerning **• Ha: hypothesized value • If t is positive, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the right of the calculated t) • If t is negative, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the left of the calculated t) © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Hypothesis Tests Concerning **Assumptions: The distribution of e at any particular x value has mean value 0 (me = 0) The standard deviation of e is , which does not depend on x The distribution of e at any particular x value is normal The random deviations e1, e2, … , en associated with different observations are independent of one another © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**The test statistic simplifies to and is called**the t ratio. Hypothesis Tests Concerning Quite often the test is performed with the hypotheses H0: = 0 vs. Ha: 0 This particular form of the test is called the model utility test for simple linear regression. The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear relationship between x and y. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example**Consider the following data on percentage unemployment and suicide rates. * Smith, D. (1977) Patterns in Human Geography, Canada: Douglas David and Charles Ltd., 158. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example**The plot of the data points produced by Minitab follows © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example**© 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example**Some basic summary statistics © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example**Continuing with the calculations © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example**Continuing with the calculations © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example**© 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Test statistic:**Example - Model Utility Test • = the true average change in suicide rate associated with an increase in the unemployment rate of 1 percentage point • H0:= 0 • Ha: 0 • has not been preselected. We shall interpret the observed level of significance (P-value) © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example - Model Utility Test**• Assumptions: The following plot (Minitab) of the data shows a linear pattern and the variability of points does not appear to be changing with x. Assuming that the distribution of errors (residuals) at any given x value is approximately normal, the assumptions of the simple linear regression model are appropriate. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Calculation:**Example - Model Utility Test • P-value: The table of tail areas for t-distributions only has t values 4, so we can see that the corresponding tail area is < 0.002. Since this is a two-tail test the P-value < 0.004. (Actual calculation gives a P-value = 0.002) © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**Example - Model Utility Test**• Conclusion: • Even though no specific significance level was chosen for the test, with the P-value being so small (< 0.004) one would generally reject the null hypothesis that = 0 and conclude that there is a useful linear relationship between the % unemployed and the suicide rate. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.**P-value**T value for Model Utility Test H0: b = 0 Ha: b 0 Example - Minitab Output Regression Analysis: Suicide Rate (y) versus Percentage Unemployed (x) The regression equation is Suicide Rate (y) = - 93.9 + 59.1 Percentage Unemployed (x) Predictor Coef SE Coef T P Constant -93.86 51.25 -1.83 0.100 Percenta 59.05 14.24 4.15 0.002 S = 36.06 R-Sq = 65.7% R-Sq(adj) = 61.8% © 2008 Brooks/Cole, a division of Thomson Learning, Inc.