1 / 40

Linear Regression

Linear Regression. CSC 576: Data Mining. Today…. Linear Regression. Advertising Dataset. https:// www.kaggle.com /sazid28/ advertising.csv. import pandas as pd advertising = pd.read_csv ('../datasets/ Advertising.csv ') advertising.head (5).

cbuckley
Download Presentation

Linear Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Regression CSC 576: Data Mining

  2. Today… • Linear Regression

  3. Advertising Dataset https://www.kaggle.com/sazid28/advertising.csv import pandas as pd advertising = pd.read_csv('../datasets/Advertising.csv') advertising.head(5)

  4. Simple Linear Regression Model for Advertising Dataset

  5. Advertising Dataset • Scatter plot visualization for TV and Sales. %matplotlib inline advertising.plot.scatter(x='TV', y='Sales');

  6. Advertising Dataset • Simple Linear Model in Python (using pandas and scikit): • Predictor: x • Response: y reg = linear_model.LinearRegression() reg.fit(advertising['TV'].reshape(-1,1), advertising['Sales'].reshape(-1,1)) print('Coefficients: \n', reg.coef_) print('Intercept: \n', reg.intercept_) Coefficients: [[ 0.04753664]] Intercept: [ 7.03259355] Sales = 7.03259 + 0.04754 * TV

  7. Assessing the Accuracy of the Model • Trying to quantify the extent to which the model fits the data • Typically assessed with: • Residual standard error (RSE) • R2 statistic • Different than measuring how well model’s predictions were on a test set • Root Mean Squared Error (RMSE)

  8. Residual Standard Error (RSE) • RSE is the average amount that the response will deviate from the true regression line • (never can perfectly predict Y from X because of the error term ε)

  9. Advertising Dataset • RSE = 3.26 • Actual sales in each market deviate from the true regression line by approximately 3.26 units, on average. • Is this error amount acceptable? • Business answer: depends on problem context • Worth noting the percentage error:

  10. Concluding Thoughts on RSE • RSE measures the “lack of fit” that a model may have. • Measured in the units of Y • Not always clear what constitutes a good RSE

  11. R2 Statistic • Proportion of variance explained • Always a value between 0 and 1 • Independent of the scale of Y (unlike RSE)

  12. R2 Statistic • TSS: total variance in the response Y • Amount of variability inherent in the response, before the regression is performed • RSS: amount of variability that is left unexplained after performing the regression • TSS-RSS : the amount of variability that is explained

  13. Advertising Dataset • R2 = 0.61 • Just under two-thirds of the variability in sales is explained by a linear regression on TV.

  14. Q: What is a good R2 value? • R2 has an interpretational advantage over RSE • A: Depends on the application. • Example: problem from physics where it is known that a linear relationship exists, can expect a good R2 value • Example: other domains where linear model is rough approximation…

  15. R2 Statistic vs. Correlation Correlation only quantifies the association between a single pair of variables. • Correlation is also a measure of the linear relationship between X and Y. • For simple linear regression (one predictor): R2 = r2 • Next: multiple linear regression (more than one predictor): use RSE

  16. Multiple Linear Regression • In practice, often have more than one predictor • Yes, we can run three separate simple linear regressions for the Advertising dataset • But, • Unclear how to make single prediction of sales given all three predictor values • Each regression equation ignores the other two media • BAD! Media may be correlated with each other

  17. Multiple Linear Regression Model • Extend the simple linear regression model for each predictor • Response variable Y is numeric (continuous) • For p predictor variables: • Since error ε has mean zero, variance σ2, with normal distribution, we usually omit it. • A one-unit change in any predictor variable xj will change the expected mean response by βj units.

  18. Advertising Dataset

  19. Estimating the Parameters β0β1β2… • Parameters (regression coefficients) are typically estimated through the method of least squares • Just like with simple linear regression • Automatic in R, Python (data mining toolkits) We want to minimize the RSS

  20. Advertising Dataset Sales = 2.938889 + 0.045765 * TV + 0.188530 * radio + -0.001037 * newspaper

  21. Simple and Multiple Linear Regression Coefficients can be Quite Different Slope term (newspaper coefficient) represents the average effect of a $1,000 increase in newspaper advertising, ignoring other predictors (TV and radio). TV Model: [[ 0.04753664]] [ 7.03259355] Radio Model: [[ 0.20249578]] [ 9.3116381] Newspaper Model: [[ 0.0546931]] [ 12.35140707] Coefficient for newspaper represents the average effect of increasing newspaper spending by $1,000 while holding TV and radio fixed. Coefficients: [[ 0.04576465 0.18853002 -0.00103749]] Intercept: [ 2.93888937]

  22. Correlation Matrix

  23. Correlation Matrix • Correlation between radio and newspaper is 0.35 • Barely any correlation (or “not correlated”) for TV/radio and TV/newspaper • Reveals tendency to spend more on Newspaper advertising in markets where more is spent on Radio advertising. • Sales higher in markets where more is spent on Radio, but more also tends to be spent on Newspaper. • In Simple LM: Newspaper “gets credit” for effect of Radio on Sales.

  24. Qualitative Predictors • So far have assumed that all variables in linear regression model are quantitative. • How to deal with qualitative variables?

  25. Credit Dataset • Response: • Balance (individual’s average credit card debt) • Quantitative Predictors: • Age (years) • Cards (number of credit cards) • Education (years of education) • Income (in thousands of dollars) • Limit (credit limit) • Rating (credit rating) • Qualitative Predictors: • Gender {Male, Female} • Student {Yes, No} • Married {Yes, No} • Ethnicity {Caucasian, African American, Asian}

  26. Qualitative Predictors: Two Levels • Levels (sometimes called factors): possible values of discrete variable • Solution: create a dummy variable (or indicator) that takes on two possible numerical values • Credit dataset, Gender variable: {Male, Female} • Create new dummy variable:

  27. Qualitative Predictors: Two Levels … for now assuming that Gender is the only predictor in model … • Simple Linear Regression Model • Estimate coefficients B0, B1 Term zeros out for males

  28. Qualitative Predictors: Two Levels • Interpretation: • B0: average credit card balance among males • B0 + B1: average credit card balance among females • B1: average difference in credit card balance between females and males

  29. Qualitative Predictors: Two Levels • Interpretation: • B0: average credit card balance among males • B0 + B1: average credit card balance among females • B1: average difference in credit card balance between females and males • Average credit card debt for males is estimated to be $509.80. • Females are estimated to carry $19.73 in additional debt, for a total of: • $509.80+$19.73=$529.53 Balance = 509.80 + 19.73 * xi

  30. Qualitative Predictors: Two Levels • Decision to code females as 1 and males as 0 is arbitrary. • It does alter the interpretation of the coefficients • What would happen if we coded males as 1 and females as 0?

  31. Qualitative Predictors: Two Levels • Interpretation: • B0: average credit card balance among females • B0 + B1: average credit card balance among males • B1: average difference in credit card balance between females and males • Average credit card debt for females is estimated to be $529.54. • Males are estimated to carry $19.73 in less debt, for a total of: • $529.54-$19.73=$509.80 Balance = 529.54 - 19.73 * xi Same exact model!

  32. Qualitative Predictors: Two Levels • Interpretation: • B0: overall average credit card balance (ignoring gender) • B1: amount that females are above the average, and males are below the average • Average credit card debt, ignoring gender is $519.67. • The average difference between males and females is: • $9.865 * 2 = $19.73 Balance = 519.67 + 9.865 * xi • Same exact model! • It doesn’t matter which coding scheme is used, as long as coefficients are correctly interpreted.

  33. Qualitative Predictors: More than Two Levels • Single dummy variable cannot represent all possible values for qualitative predictors with more than two levels • Solution: create additional dummy variables • For Ethnicity variable: Simple linear model, ignoring all other predictors…. Always one fewer dummy variable than number of levels.

  34. Qualitative Predictors: Two Levels • Interpretation: • B0: average credit card balance for African Americans • B1: difference in average balance between Asians and African Americans • B2: difference in average balance between Caucasians and African Americans Balance = 531.00 – 18.69* xi1 – 12.50* xi2 • Estimated balance for African Americans is $531.00 • Asian category will have $18.69 in less debt than African American category • Caucasian category will have $12.50 in less debt than African American category Once again, arbitrary coding scheme.

  35. African American xi1 xi2 Qualitative Predictors: Two Levels xi3 African American Balance = 520.60 + 10.38* xi1 – 8.29* xi2 – 2.11* xi3 Coefficients: [[ 10.39626236 -8.29001215 -2.10625021]] Intercept: [ 520.60373764] • Estimated balance for African Americans is $531.00 • Asian category will have $18.69 in less debt than African American category • Caucasian category will have $12.50 in less debt than African American category

  36. Multiple Quantitative and Qualitative Predictors • Not a problem • Use as many dummy variables as needed • scikit creates dummy variables automatically for the qualitative predictors

  37. In conclusion… • Pros of Linear Regression Model: • Provides nice interpretable results • Works well on many real-world problems • Cons of Linear Regression Model: • Assumes linear relationship between response and predictors: • Change in the response Y due to a one-unit change in Xiis constant • Assumes additive relationship • Effect of changes in a predictor Xi on response Y is independent of the values of the other predictors

  38. Extensions of the Linear Model • Beyond the scope of this course… • Can remove the additive assumption by specifying interaction terms • Can remove the linear assumption using polynomial regression

  39. References • Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. • Data Science from Scratch, 1st Edition, Grus • Data Mining and Business Analytics in R, 1st edition, Ledolter • An Introduction to Statistical Learning, 1st edition, James et al. • Discovering Knowledge in Data, 2nd edition, Larose et al. • Introduction to Data Mining, 1st edition, Tam et al.

More Related