E N D
1. Statistics with Economics and Business Applications Chapter 11 Linear Regression and Correlation
Correlation Coefficient, Least Squares, ANOVA Table, Test and Confidence Intervals, Estimation and Prediction
2. Review I. What’s in last two lectures?
Hypotheses tests for means and proportions Chapter 8
II. What's in the next three lectures?
Correlation coefficient and linear regression Read Chapter 11
3. Introduction
4. Example: Age and Fatness
5. Example: Age and Fatness
6. Example: Advertising and Sale The following table contains sales (y) and advertising expenditures (x) for 10 branches of a retail store.
7. Describing the Scatterplot
8. Examples
9. Investigation of Relationship There are two approaches to investigate linear relationship
Correlation coefficient: a numerical measure of the strength and direction of the linear relationship between x and y.
Linear regression: a linear equation expresses the relationship between x and y. It provides a form of the relationship.
10. The Correlation Coefficient The strength and direction of the relationship between x and y are measured using the correlation coefficient (Pearson product moment coefficient of correlation), r.
11. Example The table shows the heights and weights of
n = 10 randomly selected college football players.
12. Football Players
13. Interpreting r
14. Example: Advertising and Sale Often we want to investigate the form of the relationship for the purpose of description, control and prediction. For the advertising and sale example, sale increases as advertising expenditure increase. The relationship is almost, but not exact linear.
Description: how sales depends on advertising expenditure
Control: how much to spend on advertising to reach certain goals on sales
Prediction: how much sales do we expect if we spend certain amount of money on advertising
15. Linear Deterministic Model Denote x as independent variables and y as dependent variable. For example, x=age and y=%fat in the first example; x=advertising expenditure, y=sales in the second example
We want to find how y depends on x, or how to predict y using x
One of the simplest deterministic mathematical relationship between two variables x and y is a linear relationship y = a + ßx
16. Reality In the real world, things are never so clean!
Age influences fatness. But it is not the sole influence. There are other factors such as sex, body type and random variation (e.g. measurement error)
Other factors such as time of year, state of economy and size of inventory, besides the advertising expenditure, can influence the sale
Observations of (x, y) do not fall on a straight line
17. Probabilistic Model Probabilistic model:
y = deterministic model + random error
Random error represents random fluctuation from the deterministic model
The probabilistic model is assumed for the population
Simple linear regression model:
y = a + ßx + e
Without the random deviation e, all observed points (x, y) points would fall exactly on the deterministic line. The inclusion of e in the model equation allows points to deviate from the line by random amounts.
18. Simple Linear Regression Model
19. Basic Assumptions of the Simple Linear Regression Model The distribution of e at any particular x value has mean value 0.
The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by s.
The distribution of e at any particular x value is normal.
The random errors are independent of one another.
20. Interpretation of Terms
21. The Distribution of y For any fixed x, y has normal distribution with mean a + bx and standard deviation s.
22. Data So far we have described the population probabilistic model.
Usually three population parameters, a, ß and s, are unknown. We need to estimate them from data.
Data: n pairs of observations of independent and dependent variables
(x1,y1), (x2,y2), …, (xn,yn)
4. Probabilistic model
yi = a + b xi + ei, i=1,…,n
ei are independent normal with mean 0 and standard deviation s.
23. Steps in Regression Analysis
24. The Method of Least Squares The equation of the best-fitting line is calculated using n pairs of data (xi, yi).
25. Least Squares Estimators
26. Example: Age and Fatness
27. Example: Age and Fatness
28. Example: Age and Fatness
29. The Analysis of Variance The total variation in the experiment is measured by the total sum of squares:
30. The ANOVA Table Total df = Mean Squares
Regression df =
Error df =
31. The F Test We can test the overall usefulness of the linear model using an F test. If the model is useful, MSR will be large compared to the unexplained variation, MSE.
32. Coefficient of Determination r2 is the square of correlation coefficient
r2 is a number between zero and one and a value close to zero suggests a poor model.
It gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y.
A very high value of r² can arise even though the relationship between the two variables is non-linear. The fit of a model should never simply be judged from the r² value alone.
33. Estimate of s
34. Example: Age and Fatness
35. Example: Age and Fatness
36. Inference Concerning the Slope ß
37. Sampling Distribution
38. Confidence Interval for b
39. Example: Age and Fatness
40. Hypothesis Tests Concerning b Step 1: Specify the null and alternative hypothesis
H0: ß = ß0 versus Ha: ß ? ß0 (two-sided test)
H0: ß = ß0 versus Ha: ß > ß0 (one-sided test)
H0: ß = ß0 versus Ha: ß < ß0 (one-sided test)
Step 2: Test statistic
41. Hypothesis Tests Concerning b Step 3: Find p-value. Compute
sample statistic
42. Example: Age and Fatness
43. Interpreting a Significant Regression
44. Checking the Regression Assumptions
45. Residuals
46. Diagnostic Tools – Residual Plots
47. Normal Probability Plot
48. Residuals versus Fits
49. Estimation and Prediction
50. Estimation and Prediction
51. Estimation and Prediction
52. Estimation and Prediction
53. Example: Age and Fatness
54. Example: Age and Fatness
55. Steps in Regression Analysis
56. Minitab Output
57. Key Concepts I. Scatter plot
Pattern and unusual observations
II. Correlation coefficient a numerical measure of the strength and direction.
Interpretation of r
58. Key Concepts III. A Linear Probabilistic Model
1. When the data exhibit a linear relationship, the appropriate model is y = a + b x + e .
2. The random error e has a normal distribution with mean 0 and variance s2.
IV. Method of Least Squares
1. Estimates and , for a and b, are chosen to minimize SSE, the sum of the squared deviations about the regression line,
2. The least squares estimates are and
59. Key Concepts V. Analysis of Variance
1. Total SS = SSR + SSE, where Total SS = Syy and SSR = (Sxy)2 / Sxx.
2. The best estimate of s 2 is MSE = SSE / (n - 2).
VI. Testing, Estimation, and Prediction
1. A test for the significance of the linear regression—H0 : b = 0 can be implemented using statistic
60. Key Concepts 2. The strength of the relationship between x and y can be measured using
which gets closer to 1 as the relationship gets stronger.
3. Use residual plots to check for nonnormality, inequality of variances, and an incorrectly fit model.
4. Confidence intervals can be constructed to estimate the slope b of the regression line and to estimate the average value of y, for a given value of x.
5. Prediction intervals can be constructed to predict a particular observation, y, for a given value of x. For a given x, prediction intervals are always wider than confidence intervals.