1 / 36

# STA 106: Correlation and Linear Regression - PowerPoint PPT Presentation

STA 106: Correlation and Linear Regression . Lecturer: Dr. Daisy Dai Department of Medical Research. Contents. Correlation Regression Simple Regression Multiple Regression. What is correlation?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'STA 106: Correlation and Linear Regression' - boone

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### STA 106: Correlation and Linear Regression

Lecturer: Dr. Daisy Dai

Department of Medical Research

• Correlation

• Regression

• Simple Regression

• Multiple Regression

• Correlation and linear regression are techniques for dealing with the relationship between two or more continuous variables.

• In correlation we are looking for a linear association between two variables, and the strength of the association is summarized by the correlation coefficient (r) or coefficient of determination (r2)

• A survey was conduct to a sample of 20 anemia women, randomly selected from a pre-defined geographical area. The participants had a blood sample taken and their hemoglobin (Hb) level and packed cell volume (PCV) measured. They were also asked their age, and whether or not they had experienced the menopause.

• The goals of the study were to determine whether Hb affects PCV or the other way around or whether Hb was associated with age.

• Pearson product-moment correlation coefficient, also known as r, R, or Pearson's r, is a measure of the strength of the linear relationship between two variables that is defined in terms of the (sample) covariance of the variables divided by their (sample) standard deviations

Karl Pearson (1857 – 1936)

• The sample and population Pearson correlation coefficient, r, ranges between -1 and 1.

• The absolute value of r stands for the strength of the correlation.

• The sign of r stands for the direction of the relationship. For r>0, two variables changes in the same direction. For r<0, two variables are inversely related.

• The coefficient of determination, r2 , is the proportion of variation in the observed values of the response variable explained by the regression.

Coefficient of Determination (r2)=square of Correlation Coefficient (r)

• The coefficient of determination always lies between 0 and 1 and is a descriptive measure of the utility of the regression equation for making predictions. A value of near 0 indicates that the regression equation is not very useful for making predictions, whereas a value of near 1 indicates that the regression equation is extremely useful for making predictions.

• Regression are methods to identify the associations between the outcome variable and explanatory variables. The value of the outcome variable can be predicted by the values of explanatory variables.

• The outcome variable, also called dependent variable, is listed in the left side of regression models. The explanatory variable(s), also called independent, variable, stay in the right side of regression model.

Outcome variable  explanatory variable(s)

The birth weight = 0.2 +0.4 * Gestational age

• The relationship is summarized by a regression equation consisting of a slope and an intercept. An intercept is the constant. The slope reflects the change of change in the outcome variable with respect to the explanatory variable.

Dependent variable

Note: These are the phenomena we want to interpret the variation and predict.

For instance, response to treatment etc.

Explanatory variable

Independent variable

Risk factors

Note: These are the variables that can be used to explain the variation in the outcome variables.

For instance, demographics, environmental factors, genetic factors, medical educational intervention.

The following terminologies are used interchangeably

• To find the association between the age and price of Orion cars and predict the price by age, we randomly recorded 11 Orions and list data in the following table.

• Describe the apparent relationship between age and price of Orions.

Because the slope of the regression line is negative, price tends to decrease as age increases

• Interpret the slope of the regression line in terms of prices for Orions.

Orions depreciate an estimated \$2026 per year, at least in the 2- to 7- year-old range.

• Use the regression equation to predict the price of a 3-year-old Orion and a 4-year-old Orion.

• The regression involving one independent variable is called simple linear regression.

• Outcome variable  one explanatory variable

• y=a + b * x + error, where a is intercept and b is slope. When b=0, y is independent on x (i.e. x and y are not correlated). When b>0, x and y have positive relationship. When b<0, x and y have negative/inverse relationship.

• Height = 0.2 + 0.4 *weight

• The regression involving a set of independent variables is called multiple regression.

• Outcome variable  a set of explanatory variable

• y=a + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4+…+error

• Weight =0.2 + 0.4*height +0.3*age

• Least-Squares criterion: The straight line that best fits a set of data points is the one having the smallest possible sum of squared errors.

• Regression line: The straight line that best fits a set of data points according to the least-square criterion.

• Regression equation: The equation of the regression line.

• Total sum of squares, SST: The variation in the observed values of the response variable:

• Regression sum of squares, SSR: The variation in the observed values of the response variable explained by the regression:

• Error sum of squares, SSE: The variation in the observed values of the response variable not explained by the regression:

The three sums of squares, SST, SSR, and SSE, can be obtained by using the following computing formulas:

Total sum of squares; SST=

Regression sum of squares: SSR=

Error sum of square: SSE=

Regression identity: SST=SSR+SSE

Case Study: Anemia in Women obtained by using the following computing formulas:

• A random sample of 20 anemia women, from a pre-defined geographical area, were investigated by a survey. They had a blood sample taken and their hemoglobin (Hb) level and packed cell volume (PCV) measured. They were also asked their age, and whether or not they had experienced the menopause.

• The goal of the study is to determine whether Hb affects PCV or the other way around.

Data obtained by using the following computing formulas:

Outliers obtained by using the following computing formulas:

• An outlier is a point that lies far from the regression line. Such points may represent measuring error, or may indicate heterogeneity in sampling.

• An outlier may skew the direction of the regression line and increase the variation in the data.

• Outliers need to be removed from analysis.

Influential Observations obtained by using the following computing formulas:

• Influential observations are the points far from the other data in the horizontal direction.

• Influential observations may have a significant impact on the slope of the regression line.

• One need to compare the fitted model with influential observations vs. the fitted model without influential observations and identify the reasons of influential observations.

• Decide whether influential points need to be removed from studies.

Residuals obtained by using the following computing formulas:

• Residual is the discrepancy between the observed value and the predicted value.

• A residual plot is an useful diagnostic tool to check model assumption and to detect outliers.

Extrapolation obtained by using the following computing formulas:

• Whenever a linear regression model is fit to a group of data, the range of the data should be carefully observed. Attempting to use a regression equation to predict values outside of this range is often inappropriate, and may yield incredible answers.

• Consider, for example, a linear model which relates weight gain to age for young children. Applying such a model to adults, or even teenagers, would be absurd, since the relationship between age and weight gain is not consistent for all age groups.

Correlation is not causation obtained by using the following computing formulas:

• One of the most common errors in the medical literature is to assume that simply because two variables are correlated, therefore one causes the other. Amusing examples include the positive correlation between the mortality rate in Victorian England and the number of Church of England marriages, and the negative correlation between monthly deaths from ischemic heart disease and monthly ice-scream sales. In each case there, the fallacy is obvious because all the variables are time-related. In the former example, both the mortality rate and the number of Church of England marriages went down during the 19th century, in the latter example, deaths from ischemic heart disease are higher in winter when ice-cream sales are at their lowest. However, it is always worth trying to think of other variables, confounding factors, which may be related to both of the variables under study.

Points when performing correlation or regressions obtained by using the following computing formulas:

• Plot the data to see whether the relationship is likely to be linear.

• Is the variables normally distributed? If not, consider transformation of variable or switching to other models.

• Correlation does not necessarily imply causation.

• Think about confounding factors. If a significant correlation is obtained and the causation inferred, could there be a third factor, not measured, which is jointly correlated with the other two, and so accounts for their association?

• If a scatter plot is given to support a linear regression, is the variability of the points about the line roughly the same over the range of the independent variable? If not, then perhaps some transformation of the variables is necessary before computing the regression line.

• If predictions are given, are any made outside the range of the observed values of the independent variable?

• Outliers need to be removed from analysis.

Software obtained by using the following computing formulas:

• The open source correlation coefficient calculator: http://www.easycalculation.com/statistics/correlation.php

• We will offer a SPSS workshop for correlation, linear and logistic regression analysis in April.

References obtained by using the following computing formulas:

• Designing Clinical Research3rd edition by Hulley et al.

• Medical Statistics by Campbell et al.