Regression

1 / 60

Regression - PowerPoint PPT Presentation

Regression. Jennifer Kensler. Laboratory for Interdisciplinary Statistical Analysis. LISA helps VT researchers benefit from the use of Statistics. Experimental Design • Data Analysis • Interpreting Results Grant Proposals • Software (R, SAS, JMP, SPSS...). Walk-In Consulting

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about 'Regression' - alissa

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Regression

Jennifer Kensler

Laboratory for Interdisciplinary Statistical Analysis

LISA helps VT researchers benefit from the use ofStatistics

Experimental Design • Data Analysis • Interpreting ResultsGrant Proposals • Software (R, SAS, JMP, SPSS...)

Walk-In Consulting

Monday—Friday 12-2PM for questions requiring <30 mins

Collaboration

From our website request a meeting for personalized statistical advice

Short Courses

Designed to help graduate students apply statistics in their research

All services are FREE for VT researchers. We assist with research—not class projects or homework.

www.lisa.stat.vt.edu

Topics
• Simple Linear Regression
• Multiple Linear Regression
• Regression with Categorical Variables
Simple Linear Regression
• Simple Linear Regression (SLR) is used to model the relationship between two continuous variables.
• Scatterplots are used to graphically examine the relationship between two quantitative variables.

Sullivan (pg. 193)

Types of Relationships Between Two Continuous Variables
• Positive and negative linear relationships
Correlation
• The Pearson Correlation Coefficient measures the strength of a linear relationship between two quantitative variables. The sample correlation coefficient is

where and are the sample means of the x and y variables respectively, and and are the sample standard deviations of the x and y variables respectively.

Properties of the Correlation Coefficient
• Positive values of r indicate a positive linear relationship.
• Negative values of r indicate a negative linear relationship.
• Values close to +1 or -1 indicate a strong linear relationship.
• Values close to 0 indicate that there is no linear relation between the variables.
• We only use r to discuss linear relationships between two variables.
• Note: Correlation does not imply causation.
Simple Linear Regression

Can we describe the

behavior between

the two variables

with a linear equation?

• The variable on the x-axis is often called the explanatory or predictor variable.
• The variable on the y-axis is called the response variable.
Simple Linear Regression
• Objectives of Simple Linear Regression
• Determine the significance of the predictor variable in explaining variability in the response variable.
• (i.e. Is per capita GDP useful in explaining the variability in life expectancy?)
• Predict values of the response variable for given values of the explanatory variable.
• (i.e. if we know the per capita GDP can we predict life expectancy?)
• Note: The predictorvariable does not necessarily cause the response.
Simple Linear Regression Model
• The Simple Linear Regression model is given by

where is the response of the ith observation

is the y-intercept

is the slope

is the value of the predictor variable for the ith observation

is the random error

SLR Estimation of Parameters
• The equation for the least-squares regression line is given by

where is the predicted value of the response for a given value of x

The Residual
• The residual is the observed value of y minus the predicted value of y.
• The residual for observation i is given by
Simple Linear Regression Assumptions
• Linearity
• Observations are independent
• Based on how data is collected.
• Check by plotting residuals in the order of which the data was collected.
• Constant variance
• Check using a residual plot (plot residuals vs. ).
• The error terms are normally distributed.
• Check by making a histogram or normal quantile plot of the residuals.
Diagnostics: Residual Plot
• A residual plot is used to check the assumption of constant variance and to check model fit (is a line a good fit).
• Good residual plot: no pattern
Diagnostics
• Left: Residuals show non-constant variance.
• Right: Residuals show non-linear pattern.
Diagnostics: Normal Quantile Plot
• Left: Residuals are not normal
• Right: Normality assumption appropriate
ANOVA Table for Simple Linear Regression

The F-test tests whether there is a linear relationship between the two variables.

Null Hypothesis

Alternative Hypothesis

Test for Parameters
• Test whether the true y-intercept is different from 0.
• Test whether the true slope is different from 0.
• Note: For simple linear regression this test is equivalent to the overall F-test.
Coefficient of Determination
• The coefficient of determination, , is the percent of variation in the response variable explained by the least squares regression line.
• Note:
• We also have
Muscle Mass Example
• A nutritionist randomly selected 15 women from each ten year age group beginning with age 40 and ending with age 79. The nutritionist recorded the age and muscle mass of each women. The nutritionist would like to fit a model to explore the relationship between age and muscle mass. (Kutner et al. pg. 36)
JMP: Making a Scatterplot
• To analyze the data click Analyze and then select Fit Y by X.
JMP: Making a Scatterplot
• As shown below

Y, Response: Muscle Mass

X, Factor: Age

JMP: Scatterplot
• This results in a scatter plot.
JMP: Simple Linear Regression
• To perform the simple linear regression click on the Red Arrow and then select Fit Line.
Simple Linear Regression Results
• The results on the right are

displayed.

JMP: Diagnostics
• Click on the Red Arrow

next to Linear Fit and

select Plot Residuals.

Diagnostic Plots
• The plots to the right are

output.

Multiple Linear Regression
• Similar to simple linear regression, except now there is more than one explanatory variable.
• Body fat can be difficult to measure. A researcher would like to come up with a model that uses the more easily obtained measurements of triceps skinfold thickness, thigh circumference and midarm circumference to predict body fat. (Kutner et al. pg. 256)
First Order Multiple Linear Regression Model
• The multiple linear regression model with p-1 independent variables is given by

where are parameters

are known constants

Multiple Linear Regression ANOVA Table

The ANOVA F-test tests

Tests can also be performed for individual parameters.

(i.e. vs.

Coefficient of Multiple Determination
• The coefficient of multiple determination, , is the percent of variation in the response y explained by the set of explanatory variables.
• The adjusted coefficient of determination, , introduces a penalty for more explanatory variables.
Assumptions of Multiple Linear Regression
• Observations are independent
• Based on how data is collected (plot residuals in the order of which the data was collected).
• Constant variance
• Check using a residual plot (plot residuals vs. , plot residuals vs. each predictor variable).
• The error terms are normally distributed.
• Check by making a histogram or normal quantile plot of the residuals.
Commercial Rental Rates
• A real estate company would like to build a model to help clients make decisions about properties. The company has information about rental rate (Y), age (X1), operating expenses and taxes (X2), vacancy rates (X3), and total square footage (X4). The information is regarding luxury real estate in a specific location. (Kutner et al. pg. 251)
JMP: Commercial Rental Rates
• First, examine the data. Click Analyze, then Multivariate Methods, then Multivariate.
JMP: Scatterplot Matrix
• For Y, Columns enter Y, X1, X2, X3 and X4. Then click OK.
JMP: Fitting The Regression Model
• Click Analyze and then select Fit Model.
JMP: Fitting the Regression Model
• Y: Y, Highlight X1, X2, X3 and X4 and click Add.

Then click Run.

Fitting the Model
• Examining the parameter estimates we see that X3 is not significant.
• Fit a new model this time omittingX3.
JMP: Checking Assumptions
• Included output
• Need residuals:
• Click the red arrow next to Y Response → Save Columns → Residuals
JMP: Check Normality Assumption
• Analyze → Distribution → Y, Columns: Residual Y
• Click the red arrow next to Distribution Residual Y and select Normal Quantile Plot.
JMP: Checking Residuals vs. Independent Variables
• Analyze →Fit Y by X →

Y, Columns: Residual Y

X, Factor: X1, X2, X4

Other Multiple Linear Regression Issues
• Outliers
• Higher Order Terms
• Interaction Terms
• Multicollinearity
• Model Selection
Regression with Categorical Variables
• Sometimes there are categorical explanatory variables that we would like to incorporate into our model.
• Suppose we would like to model the profit or loss of banks last year based on bank size and type of bank (commercial, mutual savings, or savings and loan). (Kutner et al. pg. 340)
Regression Model with Categorical Variables

where is the size of bank i

• Note: There are other ways the categorical variables could have been coded, but this is how JMP codes them.
Regression with Categorical Variables
• A school district would like to determine if a new reading program improves student reading. The school district is also interested in the effect of days absent on reading improvement. Approximately half the students are assigned to the treatment group (new reading program) and half to the control group (traditional method). The students are tested at the beginning and end of the school year and the change in their score is recorded.
JMP Instructions
• Analyze  Fit Model

Y: Score Change

Days Absent

Run Model

Response Score Change  Estimates Show Prediction Expression

JMP Output
• Treatment and days absent had significant effects on improvement.
Diagnostics: Constant Variance
• Residual by Predicted plot produced automatically.
Diagnostics: Constant Variance
• Residual by Factor Plots
• First Save Residuals: Response Score Change  Save Columns  Residuals
• Produce Plots: Analyze Fit Y by X  Y, Response: Residuals Score Change; X, Factor: Treatment, Days Absent
Diagnostics: Normality
• Analyze  Distribution  Y, Columns: Residual Score Change
Conclusions
• Simple linear regression allows us to find the best fit line between a continuous explanatory variable and a continuous response variable.
• Multiple linear regression allows use to explore the relationship between a continuous response variable and multiple explanatoryvariables. (Also allows for higher order terms to be introduced.)
• Regression with categorical variables allows us to incorporate categorical predictor variables into the model.
SAS, SPSS and R
• For information about using SAS, SPSS and R to do regression:

http://www.ats.ucla.edu/stat/sas/topics/regression.htm

http://www.ats.ucla.edu/stat/spss/topics/regression.htm

http://www.ats.ucla.edu/stat/r/sk/books_pra.htm

References
• Michael Sullivan III. Statistics Informed Decisions Using Data. Upper Saddle River, New Jersey: Pearson Education, 2004.
• Michael H. Kutner, Christopher J. Nachtsheim, John Neter and William Li. Applied Linear Statistical Models. New York: McGraw-Hill Irwin, 2005.