1 / 33

Regression

Regression . FPP 10 kind of. Plan of attack. Introduce regression model Correctly interpret intercept and slope Prediction Pit falls to avoid. Regression line . Correlation coefficient a nice numerical summary of two quantitative variables

Download Presentation

Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regression FPP 10 kind of

  2. Plan of attack • Introduce regression model • Correctly interpret intercept and slope • Prediction • Pit falls to avoid

  3. Regression line • Correlation coefficient a nice numerical summary of two quantitative variables • It indicates direction and strength of association • But does it quantify the association? • It would be of interest to do this for • Predictions • Understanding phenomena

  4. Regression line • Correlation measures the direction and strength of the straight-line (linear) relationship between two quantitative variables • If a scatter plot shows a linear relationship, we would like to summarize this overall pattern by drawing a line on the scatter plot • This line represents a mathematical model. Later we will make the mathematical model a statistical one.

  5. Slope intercept form review

  6. Regression line • Slope intercept form notation • Regression form notation

  7. Regression Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945

  8. Which line is best Price = -90.2458 + 0.1598SQFT (red) Price = -300 + 0.3SQFT (blue) Price = 0 + 0.1SQFT (green)

  9. Which model to use • Different people might draw different lines by eye on a scatterplot • What are some ways we can determine which model(line) out of all the possible models(lines) is the “best” one? • What are some ways that we can numerically rank the different models? (i.e. the different lines) • This will come later in the course

  10. Slope interpretation • The slope, β, of a regression line is almost always important for interpreting the data. • The slope is a rate of change. It is the mean amount of change in y-hat when x increases by 1

  11. Slope interpretation Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 For every 1 sqft increase in size of home on average the house price increases by $159.8 dollars

  12. Intercept interpretation • The intercept, α, of the regression line is the value of y-hat when x = 0. Although we need the value of the intercept to draw the line, it is statistically meaningful only when x can actually take values close to zero.

  13. Intercept interpretation Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 If the sqft of a home was 0 on average the house price will be -$90,245.80 dollars This doesn’t make much sense here because x (sqft) doesn’t take on values close to zero.

  14. Prediction Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 For a 3500 sqft home we would predict the selling price to be price = -90.2458 + 0.1598*3500 price = $469,054.2

  15. OECD data: Income and unemployment in the U.S. • What is the relationship between households’ disposable income and the nation’s unemployment rate? • Data from the U.S. 1980 to 1998 • (data provided by the economics department at Duke)

  16. Disposable income vs unemployment rates

  17. Disposable income and unemployment rates regression output

  18. Facts about regression • There is a close relationship between the correlation coefficient and the slope of a regression line • They have the same sign • They are proportional to each other • The intercept has no relationship with the correlation coefficient but here is the formula

  19. Facts about regression • The distinction between explanatory and response variable is essential in regression • If you have a slope computed using x as the explanatory and y as the response variable you can’t “back solve” to get a slope and intercept for the regression model with x being the response and y the explanatory variables. • If you want to predict x given a y then you must find the intercept and slope with y being the explanatory variable and x being the response

  20. Facts about regression • R2 (coefficient of determination) provides a one number summary of how well regression line fits data • R2 is the percentage of variation in Y’s explained by the regression line • R2 lies between 0 and 1 • Values near 1 indicate regression predicts y’s in data set very closely • Values near 0 indicate regression does not predict the y’s in the data set very closely

  21. Facts about regression • Example: • The correlation coefficient between sale price and square feet was r = 0.8718945 • Thus the coefficient of determination is R2=(0.8718)2=0.76 • So 76% of the variability in sale price is explained by (taken into account by) the regression line with square feet.

  22. Does regression fit data well? • A regression line is reasonable if • Association between two variables is indeed linear • When points are randomly scattered around line • Income/unemployment rate data well-described by regression line.

  23. Regression of AIDS rates per 1000 people of GNP per capita • Line is too low for GDP values near zero and too high for big GDP values. • We shouldn’t use line for predictions

  24. Changing the response variable • When the regression line fits the data badly, sometimes you can transform variables to obtain a better fitting line. • With monetary variables, typically this can be accomplished by taking logarithms.

  25. Regression of log(AIDS) on log(GNP) • Much better fit • Predict log(AIDS) from log(GNP). Exponentiate to estimate AIDS

  26. Birth and death rates in 74 countries

  27. Warnings about regression • Predicting y at values of x beyond the range of x in the data is called extrapolation • This is risky, because we have no evidence to believe that the association between x and y remains linear for unseen x values • Extrapolated predictions can be absolutely wrong

  28. Extrapolation • Diamond price and carat • Explanatory variable is measured by carats and response variable is dollars • Predict price of hope diamond

  29. Extrapolation • The relationship between diamond carat and price doesn’t remain linear after a carat size of about 0.4

  30. Extrapolation • Green line is linear fit with only diamonds less then 0.4 carats • Blue line is linear fit with all carat sizes • Red curve a quadratic fit

  31. Lurking variable • A variable not being considered could be driving the relationship • In practice this is a difficult issue to tackle. Especially when everything seems OK

  32. Influential point • An outlier in either the X or Y direction which, if removed, would markedly change the value of the slope and y-interept. • applet

  33. Causality • On its own, regression only quantifies an association between x and y • It does not prove causality • Under a carefully designed experiment (or in some cases observational studies) regression can be used to show causality.

More Related