Class 10: Tuesday, Oct. 12 • Hurricane data set, review of confidence intervals and hypothesis tests • Confidence intervals for mean response • Prediction intervals • Transformations • Upcoming: • Thursday: Finish transformations, Example Regression Analysis • Tuesday: Review for midterm • Thursday: Midterm • Fall Break!
Hurricane Data • Is there a trend in the number of hurricanes in the Atlantic over time (possibly an increase because of global warming)? • hurricane.JMP contains data on the number of hurricanes in the Atlantic basin from 1950-1997.
Inferences for Hurricane Data • Residual plots and normal quantile plots indicate that assumptions of linearity, constant variance and normality in simple linear regression model are reasonable. • 95% confidence interval for slope (change in mean hurricanes between year t and year t+1): (-0.086,0.012) • Hypothesis Test of null hypothesis that slope equals zero: test statistic = -1.52, p-value =0.13. We accept since p-value > 0.05. No evidence of a trend in hurricanes from 1950-1997.
Scale for interpreting p-values: • A large p-value is not strong evidence in favor of H0, it only shows that there is not strong evidence against H0.
Inference in Regression • Confidence intervals for slope • Hypothesis test for slope • Confidence intervals for mean response • Prediction intervals
Car Price Example • A used-car dealer wants to understand how odometer reading affects the selling price of used cars. • The dealer randomly selects 100 three-year old Ford Tauruses that were sold at auction during the past month. Each car was in top condition and equipped with automatic transmission, AM/FM cassette tape player and air conditioning. • carprices.JMP contains the price and number of miles on the odometer of each car.
The used-car dealer has an opportunity to bid on a lot of cars offered by a rental company. The rental company has 250 Ford Tauruses, all equipped with automatic transmission, air conditioning and AM/FM cassette tape players. All of the cars in this lot have about 40,000 miles on the odometer. The dealer would like an estimate of the average selling price of all cars of this type with 40,000 miles on the odometer, i.e., E(Y|X=40,000). • The least squares estimate is
Confidence Interval for Mean Response • Confidence interval for E(Y|X=40,000): A range of plausible values for E(Y|X=40,000) based on the sample. • Approximate 95% Confidence interval: • Notes about formula for SE: Standard error becomes smaller as sample size n increases, standard error is smaller the closer is to • In JMP, after Fit Line, click red triangle next to Linear Fit and click Confid Curves Fit. Use the crosshair tool by clicking Tools, Crosshair to find the exact values of the confidence interval endpoints for a given X0.
A Prediction Problem • The used-car dealer is offered a particular 3-year old Ford Taurus equipped with automatic transmission, air conditioner and AM/FM cassette tape player and with 40,000 miles on the odometer. The dealer would like to predict the selling price of this particular car. • Best prediction based on least squares estimate:
Range of Selling Prices for Particular Car • The dealer is interested in the range of selling prices that this particular car with 40,000 miles on it is likely to have. • Under simple linear regression model, Y|X follows a normal distribution with mean and standard deviation . A car with 40,000 miles on it will be in interval about 95% of the time. • Class 5: We substituted the least squares estimates for for and said car with 40,000 miles on it will be in interval about 95% of the time. This is a good approximation but it ignores potential error in least square estimates.
Prediction Interval • 95% Prediction Interval: An interval that has approximately a 95% chance of containing the value of Y for a particular unit with X=X0 ,where the particular unit is not in the original sample. • Approximate 95% prediction interval: • In JMP, after Fit Line, click red triangle next to Linear Fit and click Confid Curves Indiv. Use the crosshair tool by clicking Tools, Crosshair to find the exact values of the prediction interval endpoints for a given X0.
A Violation of Linearity Y=Life Expectancy in 1999 X=Per Capita GDP (in US Dollars) in 1999 Data in gdplife.JMP Linearity assumption of simple linear regression is clearly violated. The increase in mean life expectancy for each additional dollar of GDP is less for large GDPs than Small GDPs. Decreasing returns to increases in GDP.
Transformations • Violation of linearity: E(Y|X) is not a straight line. • Transformations: Perhaps E(f(Y)|g(X)) is a straight line, where f(Y) and g(X) are transformations of Y and X, and a simple linear regression model holds for the response variable f(Y) and explanatory variable g(X).
The mean of Life Expectancy | Log Per Capita appears to be approximately a straight line.
How do we use the transformation? • Testing for association between Y and X: If the simple linear regression model holds for f(Y) and g(X), then Y and X are associated if and only if the slope in the regression of f(Y) and g(X) does not equal zero. P-value for test that slope is zero is <.0001: Strong evidence that per capita GDP and life expectancy are associated. • Prediction and mean response: What would you predict the life expectancy to be for a country with a per capita GDP of $20,000?
How do we choose a transformation? • Tukey’s Bulging Rule. • See Handout. • Match curvature in data to the shape of one of the curves drawn in the four quadrants of the figure in the handout. Then use the associated transformations, selecting one for either X, Y or both.
Transformations in JMP • Use Tukey’s Bulging rule (see handout) to determine transformations which might help. • After Fit Y by X, click red triangle next to Bivariate Fit and click Fit Special. Experiment with transformations suggested by Tukey’s Bulging rule. • Make residual plots of the residuals for transformed model vs. the original X by clicking red triangle next to Transformed Fit to … and clicking plot residuals. Choose transformations which make the residual plot have no pattern in the mean of the residuals vs. X. • Compare different transformations by looking for transformation with smallest root mean square error on original y-scale. If using a transformation that involves transforming y, look at root mean square error for fit measured on original scale.
` By looking at the root mean square error on the original y-scale, we see that all of the transformations improve upon the untransformed model and that the transformation to log x is by far the best.
The transformation to Log X appears to have mostly removed a trend in the mean of the residuals. This means that . There is still a problem of nonconstant variance.