Inference for Regression: Hypothesis Tests, Confidence & Prediction Intervals

Lecture 16 – Thurs, Oct. 30 • Inference for Regression (Sections 7.3-7.4): • Hypothesis Tests and Confidence Intervals for Intercept and Slope • Confidence Intervals for mean response • Prediction Intervals • Next time: Robustness of least squares inferences, graphical tools for model assessment (8.1-8.3)

Regression • Goal of regression: Estimate the mean response Y for subpopulations X=x, • Example: Y= neuron activity index, X=years playing stringed instrument • Simple linear regression model: • Estimate and by least squares – choose to minimize the sum of squared residuals (prediction errors)

Ideal Model • Assumptions of ideal simple linear regression model • There is a normally distributed subpopulation of responses for each value of the explanatory variable • The means of the subpopulations fall on a straight-line function of the explanatory variable. • The subpopulation standard deviations are all equal (to ) • The selection of an observation from any of the subpopulations is independent of the selection of any other observation.

The standard deviation • is the standard deviation in each subpopulation. • measures the accuracy of predictions from the regression. • If the simple linear regression model holds, then approximately • 68% of the observations will fall within of the regression line • 95% of the observations will fall within of the regression line

Estimating • Residuals provide basis for an estimate of • Degrees of freedom for simple linear regression = n-2 • If the simple linear regression models holds, then approximately • 68% of the observations will fall within of the least squares line • 95% of the observations will fall within of the least squares line

Inference for Simple Linear Regression • Inference based on the ideal simple linear regression model holding. • Inference based on taking repeated random samples ( ) from the same subpopulations ( ) as in the observed data. • Types of inference: • Hypothesis tests for intercept and slope • Confidence intervals for intercept and slope • Confidence interval for mean of Y at X=X0 • Prediction interval for future Y for which X=X0

Hypothesis tests for and • Hypothesis test of vs. • Based on t-test statistic, • p-value has usual interpretation, probability under the null hypothesis that |t| would be at least as large as its observed value, small p-value is evidence against null hypothesis • Hypothesis test for vs. is based on an analogous test statistic. • Test statistics and p-values can be found on JMP output under parameter estimates, obtained by using fit line after fit Y by X.

JMP output for example

Confidence Intervals for and • Confidence intervals provide a range of plausible values for and • 95% Confidence Intervals: • Finding CIs in JMP: Can find under parameter estimates after fitting line. Can find in Table A.2. • For brain activity study, CIs

Confidence Intervals for Mean of Y at X=X0 • What is a plausible range of values for • 95% CI for : • , • Note about formula • Precision in estimating is not constant for all values of X. Precision decreases as X0 gets farther away from sample average of X’s • JMP implementation: Use Confid Curves fit command under red triangle next to Linear Fit after using Fit Y by X, fit line. Use the crosshair tool to find the exact values of the confidence interval endpoints for a given X0.

Prediction Intervals • What are likely values for a future value Y0 at some specified value of X (=X0)? • The best single prediction of a future response at X0 is the estimated mean response: • A prediction interval is an interval of likely values along with a measure of the likelihood that interval will contain response. • 95% prediction interval for X0: If repeated samples are obtained from the subpopulations and a prediction interval is formed, the prediction interval will contain the value of Y0 for a future observation from the subpopulation X0 95% of the time.

Prediction Intervals Cont. • Prediction interval must account for two sources of uncertainty: • Uncertainty about the location of the subpopulation mean • Uncertainty about where the future value will be in relation to its mean • Prediction Error = Random Sampling Error + Estimation Error

Prediction Interval Formula • 95% prediction interval at X0 • Compare to 95% CI for mean at X0: • Prediction interval is wider due to random sampling error in future response • As sample size n becomes large, margin of error of CI for mean goes to zero but margin of error of PI doesn’t. • JMP implementation: Use Confid Curves Indiv command under red triangle next to Linear Fit after using Fit Y by X, fit line. Use the crosshair tool to find the exact values of the confidence interval endpoints for a given X0.

Example • A building maintenance company is planning to submit a bid on a contract to clean 40 corporate offices scattered throughout an office complex. The costs incurred by the maintenance company are proportional to the number of crews needed for this task. Currently the company has 11 crews. Will 11 crews be enough? • Recent data are available for the number of rooms that were cleaned by varying number of crews. The data are in cleaning.jmp. • Assuming a simple linear regression model holds, which is more relevant for answering the question of interest – a confidence interval for the mean number of rooms cleaned by 11 crews or a prediction interval for the number of rooms cleaned on a particular day by 11 crews?

Correlation • Section 7.5.4 • Correlation is a measure of the degree of linear association between two variables X and Y. For each unit in population, both X and Y are measured. • Population correlation = • Correlation is between –1 and 1. Correlation of 0 indicates no linear association. Correlations near +1 indicates strong positive linear association; correlations near –1 indicate strong negative linear association.

Correlation and Regression • Features of correlation • Dimension-free. Units of X and Y don’t matter. • Symmetric in X and Y. There is no “response” and “explanatory” variable. • Correlation only measures degree of linear association. It is possible for there to be an exact relationship between X and Y and yet sample correlation coefficient is zero. • Correlation in JMP: Click multivariate and put variables in Y, columns. • Connection to regression • Test of slope vs. is identical to test of vs. . Test of correlation coefficient only makes sense if the pairs (X,Y) are randomly sampled from population.

Correlation in JMP

Inference for Regression: Hypothesis Tests, Confidence & Prediction Intervals