PM 515 Behavioral Epidemiology Generalized Linear Regression Analysis

PM 515Behavioral EpidemiologyGeneralized Linear Regression Analysis Ping Sun, Ph.D. Jennifer Unger, Ph.D.

Topics • Review • Probit Regression • Introduction to Logistic Analysis • Logistic analysis with data from case-control study • Graphical Presentation of Logistical Regression Results • Polychotomous logistic regression • Unconditional and conditional maximum likelihood method in Logistic regression • Survival analysis • Multi-level random coefficients modeling with binary outcome • Multi-level random coefficients modeling with Count Outcome

ReviewWhat is a linear relationship? • Need to satisfy two requirements: • 1) If y=f(x) then c * y = f(c*x), c is a constant • If y1=f(x1), and y2=f(x2) then • y1 + y2 = f(x1) + f(x2) • Example: fat intake amount and caloric intake from fat

Review The formula for linear regression Y = α + β x If we define y’ = Y – α, then Y’ = β * x It is obvious that 1) c * Y’ = c * β * x 2) if Y’1 = β * x1, and Y’2 = β * x2 , then Y’1 + Y’2 = β * x1 + β * x2 ; Y’ and x is linear, while Y and x in a regression formula is affine linear

Review The Regular Linear Regression (Y = α + β x + ζ) is in fact not strictly model a linear association. But it is a ‘pure’ linear relationship after a simple conversion to Y Y1 = Y – mean(Y)

ReviewLinear regression analysis Y = α + β x + ζ

ReviewLinear straight line regression analysis Y = α + β x + ζ Assumptions: ζ independent and identically distributed (i.i.d) ζ complies with Gaussian Distribution Simply put: it assumes that ζ complies with normal distribution, thus could theoretically range from -∞ to +∞.

Review: Another Linear Regression with a little conversion: Log Linear regression analysis • Log(Y) = α + β1 x1 + β2 x2 + ζ • Y = exp(α + β1 x1 + β2 x2 + ζ)=exp(α)*exp(β1 x1)*exp(β2 x2)*exp(ζ) • Two Major differences with previous linear model: • Y is proportional to exponential functions of x1 and x2, instead of just x1 and x2 • The contribution from x1 and x2 to Y are multiplicative, not additive

ReviewLinear straight line regression analysis Y = α + β x + ζ ζ complies with Gaussian Distribution What if Y is dichotomous (Binary)? Violation of the assumptions!

ReviewLinear straight line regression analysis Y = α + β x + ζ Now Y is dichotomous (Binary, ζ is certainly limited) How to deal with this kind of violation to the assumptions to the basic regressions?

Method #1Just Treat it as a continuous outcome • If mean of the binary outcome is not at extreme value (too close to 0 or 1) • In large scale preliminary exploratory analysis

Prevalence of Daily Smoking Among Chinese Youth and Mid-aged Adultsby gender and age groups, CSCS pilot survey conducted in year 2002

ReviewLinear straight line regression analysis Y = α + β x + ζ ζ complies with Gaussian Distribution What if Y is dichotomous (Binary)? How to deal with this kind of violation to the assumptions to the basic regressions? Conduct a Transformation to Make It Linear !

Generalized Linear RegressionProbit Conversion to Y • Can we somehow convert the binary indicator of Y to another variable, and then conduct the linear regression analysis? • Answer: Yes and No • Yes: conceptually • No: in algorithm

Basic Requirement for the Candidate Transformers Y = α + β x + ζ η = a + bx + ζ Y ---> η Where Y is dichotomous with possible value of 1 or 0, η is the transformed Y • η need to be a monotonous function of Y. The higher p(Y=1) is , the larger the value of η • η need to have a possible span of (-∞, +∞). • presumably η = -∞ when p(y=1)=0, • η = +∞ when p(y=1)=1.

Method 1: Probit Regression Y=1 Y=0 Normal Distribution η The binary variable Y is a ‘manifestation’ of another variable η η can be measurable, can also be latent and not applicable for direct measurement Examples: Y=Obese, η=age and gender adjusted BMI; Y=CVD, η=disease process

Method 1: Probit Regression Y=1 Y=0 Y=0.5 Normal Distribution η The binary variable Y is a ‘manifestation’ of another variable η η can be measurable, can also be latent and not applicable for direct measurement Examples: Y=Obese, η=age and gender adjusted BMI; Y=CVD, η=disease process Mean=0, variance=1

Method 1: Probit Regression η

Method 1: Probit Regression P(Y=1) = α1 + β1 x (Y is binary) η η= α2 + β2 x( η is continuous ) With α2 and β2 estimated, a η can then be computed for each value of x Based on the estimation of η, a probability (p) can then be computed (via z score inversion) for each value of x.

Y=1 Y=0 η Method 1: Probit Regression Pr(Y=1) = α1 + β1 x (Y is binary) η= α2 + β2 x( η is continuous ) Results Interpretation How to compare the estimation of two values of x (x1 vs. x2)? X1  η1  P1 X2  η2  P2 With two z score table check-ups, we can then compare P1 and P2.

Method 1: Probit RegressionExample Proc probit data=CSCS.youth; class B_monthy_smoking ; model B_monthly_smoking = allowance (other covariates) Run;

Method 1: Probit RegressionExample SAS output for male students: Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 -1.4661 0.0490 -1.5620 -1.3701 896.63 <.0001 allowance 1 0.0056 0.0011 0.0035 0.0077 27.28 <.0001 perceived_smoking 1 0.1674 0.0276 0.1134 0.2215 36.88 <.0001 friends_smoking 1 0.5015 0.0311 0.4405 0.5625 259.40 <.0001 Other covariates ….. ….. Or say: η (monthly_smoking) = -1.4661 + 0.0056 * allowance + …

Method 1: Probit RegressionExample: Smoking and allowance in Male Chinese Youth η(smoking) = -1.4661 + 0.0056 * allowance + … If covariates were set to mean=0 before the analysis, for the average Youth in the sample, we calculated that When allowance=20 yuan/wk: η(smoking)=-1.35 -- p(smoking)=0.0879 When allowance=60 yuan/wk: η(smoking)=-1.13 -- p(smoking)=0.1292

Method 1: Probit RegressionExample: Smoking and allowance in Chinese Youth P=0.24

Method 1: Probit Regression Example When allowance=20,  p(smoking)=0.09 When allowance=60,  p(smoking)=0.13 Likely presentation in the paper: Allowance was significantly positively related to monthly smoking in Chinese male adolescents (p<0.0001). For the adolescents who received weekly allowance of 20 yuan, 9% of them smoked during the last 30 days before the survey, for those who received weekly allowance of 60 yuan, 13% of them smoked during the last 30 days before the survey.

Is there any other way that will make it a little easier to interpret and perceive the results?

0 1 -∞ +∞ Method 2: Logistic Regression P Probability: 0 +∞ Odds: P/(1-p) Log of Odds: Log(P/(1-p))

Method 2: Logistic Regression η = Log(P/(1-p)) Exp(η)=p/(1-p) (1-p) Exp(η) =p Exp(η) = p+ P*Exp(η) P= Exp(η) /(1+ Exp(η) ) = 1/(1+ Exp(-η)) Or P = logistic (η) = 1/(1+e-η)

Method 2: Logistic Regression η Mean=0, variance=Л2/3=3.29

Similarities and differences between the Logit function for logistic regression and Gaussian probability function for Probit regression

Method 2: Logistic Regression P(Y=1) = α1 + β1 x (Y is binary) η Logit(p) = log[(p/(1-p)]=η= α2+β2 x ( η is continuous ) With α2 and β2 estimated, logit(p) can then be computed for each value of x Based on the estimation of η, a logit(p) can then be computed for each value of x.

Method 2: Logistic Regression Logit(p)=log(p/(1-p)) = η= α2+β2 x • With Logit(p) calculated, what can be inferred from the results? • log(p/(1-p)) = α2 + β2 x • p/(1-p) = exp (α2 + β2 x ) •  The Odds (p/q) can be calculated for • each value of x

Method 2: Logistic Regression Odds = p/(1-p) = exp (α2 + β2 x ) • Odds1 = p1/(1-p1) = exp (α2 + β2 x1 ) • Odds2 = p2/(1-p2) = exp (α2 + β2 x2 ) • Odds Ratio = Odds1 / Odds 2 • = exp(α2+β2 x1) / exp(α2+β2 x2) • = exp (β2 x1) / exp (β2 x2) • = exp (β2 (x1-x2)) • OR can be readily calculated for each two values of x, and it is only a function of x1-x2, or say, the change in x (Δx). Thus, we only need to say something like: for an increase of 1 unit in X, the OR is …

Question 1 If the OR for the onset of smoking is 2 for one year increase in age, what will the OR be for 3 yrs increase in age? OR (Δx=1) = exp ( βΔx) = exp (β) = 2 OR (Δx=3) = exp ( βΔx) = exp (3β) = (exp (β)) 3= 2 3 = 8 Answer: For each 3 yrs older in age, the OR for the onset of smoking will be 8. Remember: logit(p) is a linear function of x, but ODDS [p/(1-p)], or Odds ratio [p1/(1-p1)] / [p2/(1-p2)] is NOT a linear function of x !!!

Case Control studya special case of logistic regression OR (male vs. female)= Odds(male) / Odds(female) = 2/3

Case Control studya special case of logistic regression Data t; x=0; y=0; weight=100; output; x=1; y=0; weight=200; output; x=0; y=1; weight=300; output; x=1; y=1; weight=400; output; Run; Proc logistic descending data=t; model y = x; freq weight; Run; Output Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 1.0986 0.1155 90.5160 <.0001 x 1 -0.4054 0.1443 7.8899 0.0050 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits x 0.667 0.502 0.885

Logistic Regression Example Proc logistic descending data=CSCS.youth; model B_monthly_smoking = allowance (other covariates) Run;

Logistic Regression ExampleSmoking and allowance in Male Chinese Youth SAS Output The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -2.6092 0.0939 772.9237 <.0001 allowance 1 0.0103 0.00195 27.7795 <.0001 perceived_smoking 1 0.2976 0.0495 36.1555 <.0001 friends_smoking 1 0.9048 0.0579 244.0508 <.0001 ---- for other covariates ----

Logistic Regression ExampleSmoking and allowance in Male Chinese Youth logit(smoking) = -2.61 + 0.01 * allowance + … Whether the covariates were set to mean 0 or not, we can always calculate that: Log (OR (allowance=60 vs. allowance=20)) = (60-20) * 0.01 = 0.4 OR (allowance=60 vs. allowance=20) = exp (0.4) = 1.49

Logistic Regression ExampleSmoking and allowance in Male Chinese Youth logit(smoking) = -2.61 + 0.01 * allowance + … OR (allowance=60 vs. allowance=20) = 1.49 Likely wording in a paper: Allowance to Chinese boy was found to be positively related to cigarette smoking in the last 30 days (p<0.0001). The Odds for monthly smoking was 1.49 (95% CI: 1.28-1.74) higher for each 40 yuan more in weekly allowance to the boys.

Question 2 How was the 95% CI calculated? Show it in Excel: beta±se for 1 yuan is 0.0103±0.00195 Question: Somebody reported in his manuscript an OR and its 95%CI as OR (95%CI) = 2 (1-3). Is there anything wrong with the numbers?

Logistic vs. Probit Regressionsto calculate OR from probit outputs Probit: P (allowance=60)= 0.13 Odds (allowance=60) = p/q = 0.13/(1-0.13) = 0.15 P(allowance=20)= 0.09 Odds (allowance=20) = p/q = 0.09/(1-0.09) = 0.10 OR (allowance=60 vs. allowance=20) = Odds(allowance=60)/Odds (allowance=20) = 0.15 / 0.10 = 1.50 p<0.0001 Logistic OR= 1.49 P<0.0001

Logistic vs. Probit Regressionsto estimate the percentage from Logistic outputs Logistic logit(smoking) = -2.61 + 0.01 * allowance + … OR (allowance=60 vs. allowance=20) = 1.49 Lab Exercise 1: Calculate the percentages for when allowance=20, 60. Then to compare the percentages with the results from the Probit analysis.

Graphical Presentation of Results from Logistic Regression P<0.0001 P=0.17 Log(Odds) is a linear function of X

Graphical Presentation of Results from Logistic Regression Odds is no longer a linear function of X

Graphical Presentation of Results from Logistic RegressionOR of monthly smoking among boys (compared with when allowance =20) When talking about an OR, remember that it is a comparison of two Odds

Graphical Presentation of Results from Logistic RegressionOR of monthly smoking among boys (compared with when allowance =60)

Graphical Presentation of Results from Logistic Regression: SAS StatementsConvert a continuous X to a categorical one data d2; set d1_out; allowance_1=0; allowance_2=0; allowance_3=0; allowance_4=0; allowance_5=0; if 0.0 <=allowance<= 2.5 then allowance_1=1 ; else if 7.5 <=allowance<= 7.5 then allowance_2=2 ; else if 15.0 <=allowance<= 15.0 then allowance_3=3 ; else if 25.0 <=allowance<= 35.0 then allowance_4=4 ; else if 45.0 <=allowance then allowance_5=1 ; if male=1; run; proc logistic descending data=d2 ; model monthcig1 = allowance_2 allowance_3 allowance_4 allowance_5 other covariates ; run;

To Continue from here

Graphical Presentation of Results from Logistic Regression: Output The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -2.7529 0.1337 424.0091 <.0001 allowance_2 1 0.0258 0.0898 0.0827 0.7736 allowance_3 1 0.1780 0.0543 10.7411 0.0010 allowance_4 1 0.1254 0.0398 9.9170 0.0016 ALLOWANCE_5 1 0.8207 0.1623 25.5796 <.0001 Other terms for the covariates

PM 515 Behavioral Epidemiology Generalized Linear Regression Analysis