Linear Regression with R

Linear Regression with R Mike Jadoo

Purpose • Bring about an awareness • Enable individuals to properly create analysis • Select the most appropriate model(s).

R i386 3.2.3 (32 bit)

Linear Regression Model Review topics theory or use past experience Formulate a initial model Find the data Check the data/hypothesis Estimate the model Reformulate the model Check the parameter estimates Interpret your results/reporting

Example • Hypothesis: does national wages levels effect how much is spent by the population? If so, by how much? • Data source: BEA, personal consumption expenditures (dependent variable[Y]) and wages/salary (independent variable [X])

Create the hypothesis - What are you trying to analyze or predict • Go over the topics relative theory -May involve extensive reading but it is a good first start.

Finding the data sources Government statistical agencies is a good first start!! Can’t find the data your looking for? Call the agency, staff is there to help. There are more providers of data, some have a cost.

Finding the data sources • Obtain variable’s methodology, ask yourself: “Is this acceptable?” • Number of observations- 20 to 25 minimum

Finding the data sources • Construct it yourself Observations from Big data sources Apply the average(mean) Other calculations

Fix the data series • Missing Observations (imputation methods-not optimal but sometimes there is no choice) Interpolation

Fix the data series Interpolation

Fix the data series • Other methods • Mean/Median: • Mode • Median • Mean • Nearest neighbor • If there is outliers or influential observations -preferred method of correction - median

Linear Regression Model • Simple model assumptions- (two variables) 1. The model is linear 2. The observations are random 3. The X variable observations is not the same 4. The average of the residuals equal zero 5. The errors variance around the regression line are the same (homoskedasticity)

Construct initial model • This might not be the model you end with • Start looking at your response and explanatory variables • Might need more than one (maybe 3)

Scatter Plot Using the PCE and Wages and Salary estimates

Examine the Data Import Histogram • Command

Normality/Central Limit theorem • Why are these topics important • If your data is not considered normally distributed(or close to normal) your analysis will be poor -Can’t predict/analyze a series that acts unpredictable

Signs of Normality • If the mean and median is close in value

Signs of Normality Skewness Kurtosis: measures the peak of the data, and tail thickness. value of 0 to be normal. Negative kurtosis says that the tails are light (platykurtic). Heavy flank. Positive kurtosis the data have heavier tails (leptokurtic) [outlier prone distribution]. • measures tendency of your data to be spread out • Right- would have a positive value, mean is greater than median, outlier prone series • Left- would be a negative value, mean is less than the median

Signs of Normality Skewness Kurtosis: (platykurtic). Heavy flank. (leptokurtic) Positive kurtosis the data have heavier tails • Right- • Left-

Examine the Data • Skewness (?) • Kurtosis (?)

Examine the Data • Skewness (left skewed) • Kurtosis (not normally distributed)

Examine the Data • Formal test for normality • Jarque Bera Test Hypothesis Ho: normally distributed Ha: not normal

Fix the data series At this point you should decide whether or not to transform your data and identify which variable might not be needed.

Fix data series Data transformations • Log- • Quadratic (if your explanatory variable has a non-linear shape) • Model’s interpretation will be different at this point! Always document what you are doing (others want to know) !!

Linear Regression Model • Simple regression one response variable (y) and one explanatory variable (x)

Linear Regression Model Test models assumptions • Test for Heteroscedasticity The residual variance values are very different ( at this point you need to either conduct data transformation processes or revaluate your models variables)

Linear Regression Model • Test for Heteroscedasticity (Breusch-Pagan) The null hypothesis is if the error variance (homoscedasticity) is equal if not then there is presence of Heteroscedasticity. Our example, do not reject the null (p> 0.05)

Selecting the Best Model

What to look for in your output • Parameter estimate test- are they statistical significant look at the p-value (close to 0) and t-stat (above 1.96) • Goodness of Fit: R2 Important statistical measure to use when selecting the optimal model(s).

Multiple Regression • Assumptions: 1. The model is linear 2. The observations are random 3. The variables in the model are independent (no perfect collinearity) 4. The average of the residuals equal zero 5. The errors variance around the regression line are the same (homoskedasticity)

Multiple Regression • Example using Trees data set in R

Multiple Regression Homoscedasticity check * No Prefect Collinearity check

Multiple Regression • ANOVA TABLE

Time Series Regression model • Assumptions: 1. The model is linear 2. No perfect collinearity 3. The average of the residuals equal zero 4. The errors variance around the regression line are the same (homoskedasticity) 5. No serial Correlation 6. The residuals are normality distributed

Time Series Regression model • Create trend variable • Test assumptions • Using Lm() but there are options

Regression model • Selecting the best model(s) report more than one which statistical measures to use -R2 vs Adjusted R2 -Beta Coefficients-parameters estimates -Coefficient variation -RMSE

Regression Model How you say it counts!!! • Model interpretation: Untransformed variables: every unit increase/decrease cause a unit increase/decrease Logged variables: every unit increase/decrease cause % change increase/decrease R2

Hypothesis Have you proved or disproved your hypothesis Are there any other note worthy findings you wish to report

Reporting • STARGAZER

THERE IS MORE TO EXPLORE !!!!!

Recap

Thank you

Linear Regression with R

Linear Regression with R

Presentation Transcript

Linear regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Regression Linear Regression

Linear Regression

Linear Regression

LINEAR REGRESSION

Linear Regression

Linear Regression using R

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear regression

Linear Regression

Linear Regression