Lecture 4 – Linear Regression Analysis

Lecture 4 – Linear Regression Analysis

Steps in Data Analysis • Step 1: Collect and clean data (spreadsheet from heaven) • Step 2: Calculate descriptive statistics • Step 3: Explore graphics • Step 4: Choose outcome(s) and potential predictive variables (covariates) • Step 5: Pick an appropriate statistical procedure & execute • Step 6: Evaluate fitted model, make adjustments as needed

Choice of Analysis Four Considerations 1) Purpose of the investigation • Descriptive orientation 2) The mathematical characteristics of the variables • Level of measurement (nominal, ordinal, continuous) and Distribution 3) The statistical assumptions made about these variables • Distribution, Independence, etc. 4) How the data are collected • Random sample, cohort, case control, etc.

Simple Linear Regression • Purpose of analysis: To relate two variables, where we designate one as the outcome of interest (Dependent Variable or DV) and one more as the predictor variables (Independent Variables or IVs) • In general, we will consider k to represent the number of IVs and here k=1. • Given a sample of n individuals, we observe pairs of values for 2 variables (Xi,Yi) for each individual i. • Type of variables: Continuous (interval or ratio)

Straight Line Regression Analysis

Regression Analysis - Some Possible Goals • Characterize relationship by determining extent, direction, and strength of association between IVs and DV. • Predict DV as a function of IVs • Describe relationship between IVs and DV controlling for other variables (confounders) • Determine which IVs are important for predicting a DV and which ones are not. • Determine the best mathematical model for describing the relationship between IVs and a DV

Regression Analysis - Some Possible Goals • Assess the interactive effects (effect modification) of 2 or more IVs with regard to a DV • Obtain a valid and precise estimate of 1 or more regression coefficients from a larger set of regression coefficients in a given model. NOTE: When we find statistically significant associations between IVs and a DV this does not imply that the particular IVs caused the DV to occur.

Seven Criteria for Causation • Strength of association - does the association appear strong for a number of different studies? • Dose-response effect - The DV changes in a meaningful manner with changes in the IV • Lack of temporal ambiguity - The cause precedes the effect • Consistency of findings - Most studies show similar results • Biological and theoretical plausibility - The causal relationship is consistent with current biological and theoretical knowledge • Coherence of evidence - The findings do not seriously conflict with accepted facts about the DV being studied. • Specificity of association - The study factor is associated with only one effect

Simple Linear Regression Model where: • Yiis the value of the response(outcome, dependent) variable for the ith unit (e.g., SBP) • 0 and 1 are parameters which represent the intercept and slope, respectively • Xi is the value of the predictor (independent) variable (e.g., age) for the ith unit. X is considered fixed - not random. • i is a random error term that has mean 0 and variance 2, i and j are uncorrelated for all i,j ij, i=1,...,n

Simple Linear Regression Model • Model is "simple" because there is only one independent variable. • Model is "linear in the parameters" because the parameters β0 and β1 do not appear as an exponent and they are not multiplied or divided by another parameter. • Model is also "linear in the independent variable" because this variable (Xi) appears only in the first power.

Features of the model • The observed value of Y for the ith unit is the sum of 2 components (1) the constant term β0 + β1Xiand (2) the random error term i. Hence, Yiis a random variable. • Since i has mean 0, Y must have mean β0 + β1Xi: E(Yi|Xi) = E(β0 + β1Xi + i) = β0 + β1Xi + E(i) = β0+ β1Xi where E = "Expected value”=mean

Y X The fitted (or estimated) regression line is the expected value of Y at the given value of X, i.e. E(Y|X)

Residuals ε Y ε X

Interpreting the Coefficients Expected change in Y per unit change in X Y 1.0 X Expected value of Y when X=0

Linear Model Assumptions Linear relationship between Y and X (i.e., only allow linear β’s) Independent observations Normally distributed residuals, in particular εi~N(0, σ2) Equal variances across values of X (homogeneity of variance)

Normality Assumption  •  X1 = 10 Y 0 + 1x1 X

• • • • Assumptions of the model • Homoscedasticity - The variance of Y is the same for any X Y X

Departures from Normality Assumption • If the normality assumption is not “badly” violated, the model is generally robust to violations from normality • If normality assumption is badly violated, try a transformation of Y (e.g., the natural log) • If you transform the data, you must consider if Y is normally distributed as well as whether the variance homogeneity assumption holds – often go together

Other Assumptions • The “correct” model is fitted • All IVs included are truly related to the DV • No (conceivable) IVs related to the DV have been left out Violation of either of these assumptions can lead to “model misspecification bias”

Model Hypothesis Tests • Null Hypothesis: • The simple linear regression model does not fit the data better than the baseline model. • 1 = 0 • Alternative Hypothesis: • The simple linear regression model fits the data better than the baseline model. • 1  0

Fitting data to a linear model Linear Regression – determine the values of β0and β1 that minimize: The LEAST-SQUARES Solution

The Least-Squares Method • For each pair of observations (Xi,Yi), the method of least squares considers the deviation of Yi from its expected value: the least-squares method will find and that minimize the sum of squares above. The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line the smallest.

The Least-Squares Method

The Least Squares Method • The method of least squares is purely mathematical • However, statistically the least squares estimators are very appealing because the are the Best Linear Unbiased Estimators (BLUE) • This means that among all of the equations we could have picked to estimate 0 and 1,the least squares equations will give us estimates: • That have expectation 0 and 1 (unbiased) • Have minimum variance among all of the possible linear estimators for0 and 1 (most efficient)

Quality of Fit SSE is the sum of squares due to error (i.e., sum of the squared residuals), the quantity we wish to “minimize”.

Response (Y) Total Unexplained Variability Variability _ Explained Variability Y Predictor (X) Explained versus Unexplained Variability

Estimate of σ2 • If SSE=0, then model is perfect fit • SSE is affected by • Large 2 (a lot of variability) • Nonlinearity • Need to look at both (1) and (2). • For now assume linearity, and estimate σ 2 as: We use n – 2 because we estimate 2 parameters, 0 and 1 • SSE/(n-2) is also known as “mean squared error” or MSE

Simple Linear Regression in a How do I build my model? Using the tools of statistics… • First I use estimation • in particular, least squares to estimate: • Then I use my distributional assumptions to make Inference about the estimates • Hypothesis testing, e.g., is the slope  0? • Interpretation – interpret in light of assumptions

Hypothesis Testing for Regression Parameters Hypothesis testing: To test the hypothesis H0: β1=β1(0), where β1(0) is some hypothesized value for β1, the test statistic used is This test statistic has a t distribution with n- 2 degrees of freedom The CI is given by

Timeout: The T-distribution • The t distribution (or Student’s t distribution) arises when we use an estimated variance to construct the test statistic: • As n→∞, T→Z~N(0,1) • Have to pay a penalty for estimatingσ2

Timeout: The T-distribution Can think of the t distribution as a thick-tailed normal

Inference concerning the Intercept To test the hypothesis H0: β0=β0(0) we use the following statistic which also has the t distribution with n-2 degrees of freedom when Ho:β0= β0(0) The CI is given by

Model Hypothesis Test • Null Hypothesis: • The simple linear regression model does not fit the data better than the baseline model. • 1 = 0 • Alternative Hypothesis: • The simple linear regression model does fit the data better than the baseline model. • 1  0

Interpretations of Tests for Slope Failure to reject H0:β1=0 could mean: • is essentially as good as for predicting Y y A • • • • • • • • • • • • • • x

Interpretations of Tests for Slope Failure to reject H0:β1=0 could mean: • The true relationship between Yand X is notlinear (i.e. could be quadratic or some other higher power) y • • • • • • • • • • • • • • • • • • • • • • • • • • • x Dude, that’s why you always plot Y vs. X!

Interpretations of Tests for Slope Failure to reject H0:β1=0 could mean: • We do not have enough power to detect a significant slope Not rejecting H0:β1=0 implies that a straight line model in X is not the best model to use, and does not provide much help for predicting X (ignoring power)

The Intercept We often leave the intercept, β0, in the model regardless of whether the hypothesis, H0:β0=0, is rejected or not. This is because if we say the intercept is zero then we must force the regression line through the origin (0,0) and rarely is this true.

Example: SBP and Age • Regression of SBP on age:

Example Regression of SBP on Age Analysis of Variance Sum of Mean Source DF Squares Square Model 1 4008.12372 4008.12372 Error 28 2319.37628 82.83487 Corrected Total 29 6327.50000 Root MSE 9.10137 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t| Intercept 1 54.21462 13.08530 4.14 .0003 age 1 1.70995 0.24582 6.96 .0001

Multiple Linear Regression • Response Variable: Y • Explanatory Variables: X1,..., Xk • Model (Extension of Simple Regression): • E(Y) = b0 + b1X1 +  + bkXkV(Y) = s2 • Partial Regression Coefficients (bi): Effect of increasing Xi by 1 unit, holding all other predictors constant • Computer packages fit models, hand calculations very tedious

Prediction Equation & Residuals • Model Parameters: b0, b1,…, bk, s • Estimators: • Least squares prediction equation: • Residuals: • Error Sum of Squares: • Estimated conditional standard deviation:

Multiple Regression - Graphical Display • When there are 2 independent variables (X1 and X2) we can view the regression as fitting the best plane to the 3 dimensional set of points (as compared to the best line in simple linear regression) • When there are more than 2 IVs plotting becomes much more difficult

Graphical Display – 2 IVs

Standard Regression Output • Analysis of Variance: • Regression sum of Squares: • Error Sum of Squares: • Total Sum of Squares: • Coefficient of (Multiple) Determination: R2=SSR/TSS (the % of variation explained by the model) • Least Squares Estimates • Regression Coefficients • Estimated Standard Errors • t-statistics • P-values (Significance levels for 2-sided tests)

Example: Brachial Reactivity Data

Example: Brachial Reactivity Max Diameter Dilation Phase (mm) Histogram # Boxplot Normal Probability Plot 7.75+* 1 0 7.75+ * .* 4 0 | * .*** 12 | | *****+ .******* 27 | | ******+ .***************** 65 | | ******+ .************************ 96 +-----+ | ****** .************************************* 146 *--+--* | ****** .**************************************** 157 | | | ******* .***************************** 114 +-----+ | ******* .***************** 65 | | ********* .**** 14 | |******++ 2.25+* 1 | 2.25+*+ ----+----+----+----+----+----+----+----+ +----+----+----+----+----+----+----+----+----+----+ Pre-cuff, Baseline (mm) Histogram # Boxplot Normal Probability Plot 7.75+* 1 0 7.75+ * .* 3 0 | * .** 6 0 | **** .******* 25 | | ******+ .************** 54 | | ******+ .*********************** 91 +-----+ | *****+ .********************************* 131 | | | +****** .*************************************** 156 *--+--* | ****** .********************************* 132 +-----+ | ******* .******************** 80 | | ********* .******* 25 | |********+ 2.25+* 2 | 2.25+*+++ ----+----+----+----+----+----+----+---- +----+----+----+----+----+----+----+----+----+----+ * may represent up to 4 counts -2 -1 0 +1 +2

Example: Brachial Reactivity Time to Max Diameter(sec) Stem Leaf # Boxplot Normal Probability Plot 11 555555555555555999 18 | 117.5+ ++****** 11 0000011112222333344444444444444 31 | | ***** 10 555556666666677778889999 24 | | ***+ 10 000001111122223333334444 24 | | ***+ 9 5555555555555667777777777888889999 34 | | ***+ 9 0000000000111122222333344444444 31 | | ***++ 8 555555556667778888999999999 27 +-----+ | **++ 8 000000001112222223333334444 27 | | | **+ 7 555555666677778888889999 24 | | | **+ 7 00000000001112222222333333344444444 35 | | | ** 6 555555556666666666777777778888999999999 39 | | | *** 6 000000001111111222222333333344444 33 | + | | ** 5 5555555556666666667777777777788888888999999999 46 *-----* | +** 5 0000000000000011111122222222223333334444444444 46 | | | +** 4 55555556666666677777777778888888899999 38 | | | *** 4 0000000111112222222233333444444444 34 | | | +** 3 5555555555666666677777778888888999999999 40 +-----+ | +*** 3 0000000000111111122222222222333333334444444444 46 | | *** 2 55555566666666667777788888889999999 35 | | **** 2 0002222222223333334444 22 | | *** 1 555566677888899999999 21 | | *** 1 00111122222233333444 20 | | *****++ 0 7779 4 | |*** ++ 0 0444 4 | 2.5+* + ----+----+----+----+----+----+----+----+----+- +----+----+----+----+----+----+----+----+----+----+ Age (years) Histogram # Boxplot Normal Probability Plot 93+* 2 0 93+ * .*** 12 0 | ***** .**** 14 | | ***++++ 87+******** 29 | 87+ ****+++ .*********** 43 | | ****++ .******************* 73 | | ***** 81+**************** 63 +-----+ 81+ +**** .************************* 100 | + | | ++**** .***************************************** 161 *-----* | +******* 75+************************************* 148 +-----+ 75+ ********** .*********** 43 | | ******+++ .**** 16 | |******++++ 69+* 1 | 69+*+++++ . |+ . | 63+* 1 0 63+* ----+----+----+----+----+----+----+----+- +----+----+----+----+----+----+----+----+----+----+

Max Diameter Dilation (mm) vs. Pre-cuff (mm)

Regression of Max Diameter (mm) on Pre-cuff Baseline (mm) R2

Lecture 4 – Linear Regression Analysis

Lecture 4 – Linear Regression Analysis

Presentation Transcript

K-nearest neighbor methods

Ordinal Regression Analysis: Fitting the Proportional Odds Model Using Stata and SAS

Regression in geoDA

Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

Linear Programming: Sensitivity Analysis and Interpretation of Solution

Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA)

Advanced Statistics for Interventional Cardiologists

Introduction to Algorithmic Trading Strategies Lecture 6

Multiple Regression

Statistical Inference and Regression Analysis: GB.3302.30

Lecture 5

Linear Regression and Correlation Analysis

Chapter 11

PM 515 Behavioral Epidemiology Generalized Linear Regression Analysis

Statistical Inference and Regression Analysis: GB.3302.30

What statistical analysis should I use?

Some further problems with regression models

Chapter 12 Multiple Regression

Logistic Regression Analysis