1 / 43

Linear Regression with R

Linear Regression with R . Mike Jadoo. Purpose . Bring about an awareness Enable individuals to properly create analysis Select the most appropriate model(s). R i386 3.2.3 (32 bit). Linear Regression Model. Review topics theory or use past experience. Formulate a initial model.

stimmy
Download Presentation

Linear Regression with R

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Regression with R Mike Jadoo

  2. Purpose • Bring about an awareness • Enable individuals to properly create analysis • Select the most appropriate model(s).

  3. R i386 3.2.3 (32 bit)

  4. Linear Regression Model Review topics theory or use past experience Formulate a initial model Find the data Check the data/hypothesis Estimate the model Reformulate the model Check the parameter estimates Interpret your results/reporting

  5. Example • Hypothesis: does national wages levels effect how much is spent by the population? If so, by how much? • Data source: BEA, personal consumption expenditures (dependent variable[Y]) and wages/salary (independent variable [X])

  6. Create the hypothesis - What are you trying to analyze or predict • Go over the topics relative theory -May involve extensive reading but it is a good first start.

  7. Finding the data sources Government statistical agencies is a good first start!! Can’t find the data your looking for? Call the agency, staff is there to help. There are more providers of data, some have a cost.

  8. Finding the data sources • Obtain variable’s methodology, ask yourself: “Is this acceptable?” • Number of observations- 20 to 25 minimum

  9. Finding the data sources • Construct it yourself Observations from Big data sources Apply the average(mean) Other calculations

  10. Fix the data series • Missing Observations (imputation methods-not optimal but sometimes there is no choice) Interpolation

  11. Fix the data series Interpolation

  12. Fix the data series • Other methods • Mean/Median: • Mode • Median • Mean • Nearest neighbor • If there is outliers or influential observations -preferred method of correction - median

  13. Linear Regression Model • Simple model assumptions- (two variables) 1. The model is linear 2. The observations are random 3. The X variable observations is not the same 4. The average of the residuals equal zero 5. The errors variance around the regression line are the same (homoskedasticity)

  14. Construct initial model • This might not be the model you end with • Start looking at your response and explanatory variables • Might need more than one (maybe 3)

  15. Scatter Plot Using the PCE and Wages and Salary estimates

  16. Examine the Data Import Histogram • Command

  17. Normality/Central Limit theorem • Why are these topics important • If your data is not considered normally distributed(or close to normal) your analysis will be poor -Can’t predict/analyze a series that acts unpredictable

  18. Signs of Normality • If the mean and median is close in value

  19. Signs of Normality Skewness Kurtosis: measures the peak of the data, and tail thickness. value of 0 to be normal. Negative kurtosis says that the tails are light (platykurtic). Heavy flank. Positive kurtosis the data have heavier tails (leptokurtic) [outlier prone distribution]. • measures tendency of your data to be spread out • Right- would have a positive value, mean is greater than median, outlier prone series • Left- would be a negative value, mean is less than the median

  20. Signs of Normality Skewness Kurtosis: (platykurtic). Heavy flank. (leptokurtic) Positive kurtosis the data have heavier tails • Right- • Left-

  21. Examine the Data • Skewness (?) • Kurtosis (?)

  22. Examine the Data • Skewness (left skewed) • Kurtosis (not normally distributed)

  23. Examine the Data • Formal test for normality • Jarque Bera Test Hypothesis Ho: normally distributed Ha: not normal

  24. Fix the data series At this point you should decide whether or not to transform your data and identify which variable might not be needed.

  25. Fix data series Data transformations • Log- • Quadratic (if your explanatory variable has a non-linear shape) • Model’s interpretation will be different at this point! Always document what you are doing (others want to know) !!

  26. Linear Regression Model • Simple regression one response variable (y) and one explanatory variable (x)

  27. Linear Regression Model Test models assumptions • Test for Heteroscedasticity The residual variance values are very different ( at this point you need to either conduct data transformation processes or revaluate your models variables)

  28. Linear Regression Model • Test for Heteroscedasticity (Breusch-Pagan) The null hypothesis is if the error variance (homoscedasticity) is equal if not then there is presence of Heteroscedasticity. Our example, do not reject the null (p> 0.05)

  29. Selecting the Best Model

  30. What to look for in your output • Parameter estimate test- are they statistical significant look at the p-value (close to 0) and t-stat (above 1.96) • Goodness of Fit: R2 Important statistical measure to use when selecting the optimal model(s).

  31. Multiple Regression • Assumptions: 1. The model is linear 2. The observations are random 3. The variables in the model are independent (no perfect collinearity) 4. The average of the residuals equal zero 5. The errors variance around the regression line are the same (homoskedasticity)

  32. Multiple Regression • Example using Trees data set in R

  33. Multiple Regression Homoscedasticity check * No Prefect Collinearity check

  34. Multiple Regression • ANOVA TABLE

  35. Time Series Regression model • Assumptions: 1. The model is linear 2. No perfect collinearity 3. The average of the residuals equal zero 4. The errors variance around the regression line are the same (homoskedasticity) 5. No serial Correlation 6. The residuals are normality distributed

  36. Time Series Regression model • Create trend variable • Test assumptions • Using Lm() but there are options

  37. Regression model • Selecting the best model(s) report more than one which statistical measures to use -R2 vs Adjusted R2 -Beta Coefficients-parameters estimates -Coefficient variation -RMSE

  38. Regression Model How you say it counts!!! • Model interpretation: Untransformed variables: every unit increase/decrease cause a unit increase/decrease Logged variables: every unit increase/decrease cause % change increase/decrease R2

  39. Hypothesis Have you proved or disproved your hypothesis Are there any other note worthy findings you wish to report

  40. Reporting • STARGAZER

  41. THERE IS MORE TO EXPLORE !!!!!

  42. Recap

  43. Thank you

More Related