1 / 36

Design and Analysis of Clinical Study 9. Analysis of Cross-sectional Study

Design and Analysis of Clinical Study 9. Analysis of Cross-sectional Study. Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia. Overview. Estimate of prevalence Analysis of difference between two proportions Analysis of difference among proportions: Chi-square

lenci
Download Presentation

Design and Analysis of Clinical Study 9. Analysis of Cross-sectional Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Analysis of Clinical Study 9. Analysis of Cross-sectional Study Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia

  2. Overview • Estimate of prevalence • Analysis of difference between two proportions • Analysis of difference among proportions: Chi-square • Analysis of difference between two means • Analysis of association I: simple linear regression analysis • Analysis of association II: multiple regression analysis

  3. Prevalence of Disease • Prevalence is NOT incidence • Measures the no. of people in a population who have the disease at a given point in time. • this measure has been called point prevalence, in contrast to period prevalence, infrequently used, which sums cases existing at the start of a time period to new cases that occur during the time period • A measure of disease status, disease burden • in contrast to incidence which measures disease onset events

  4. 1 2 3 4 5 T Time Prevalence: At time T, 2 out of 5 subjects had the disease; P = 2/5 = 0.4

  5. Sampling Variability in Prevalence • Prevalence in the population (p) is UNKNOWN • Sample prevalence (p) is an unbiased estimate of p. x = number of diseased individuals in the sample p = prevalence N = sample size • Estimates: p = x/N variance of p: standard error of p: 95% CI of p:

  6. An Example of Calculation of Prevalence • The prevalence of ABO hemolytic disease in a population is 43 out of 3584 subjects. • So, the estimated prevalence: p = 43/3584 = 0.0125 • Standard error of the prevalence: • 95% confidence interval: 0.0125+(1.96 x 0.002) = 0.009 to 0.016.

  7. Test for Difference Between Two Proportions p1 = proportion for group 1 p2 = proportion for group 2 N1 = sample size for group 1 N2 = sample size for group 2 d = p1 – p2 variance of d: z-test: d = 0.268 – 0.211 = 0.057 variance of d: s2 = 0.000238 + 0.000152 = 0.000391 z-test: z = 0.057 / sqrt(0.00391) = 2.87 Significant!

  8. Test for Difference Among Proportions Caffeine consumption 1- 151- 300- None 150 300 900 Total ____________________________________________ Marital status Married 652 1537 598 242 3029 Divorced 36 46 38 21 141 Single 218 327 106 67 718 Total 906 1910 742 330 3888 652/3029=0.22 1537/3029=0.51 598/3029=0.20 242/3029=0.08 36/141=0.26 46/141=0.33 38/141=0.27 21/141=0.15 218/718=0.30 327/718=0.46 106/718=0.15 67/718=0.09 906/3888=0.23 1910/3888=0.49 742/3888=-.19 330/3888=0.08 In percent (row) Married 0.22 0.51 0.20 0.08 100 Divorced 0.26 0.33 0.27 0.15 100 Single 0.30 0.46 0.15 0.09 100 Total 0.23 0.49 0.19 0.08 100

  9. Test for Difference Among Proportions Caffeine consumption 1- 151- 300- None 150 300 900 Total _______________________________________________ Marital status Married 652 1537 598 242 3029 Divorced 36 46 38 21 141 Single 218 327 106 67 718 Total 906 1910 742 330 3888 3029/3888*906=705.8 3029/3888*1910=1488 3029/3888*742=578.1 3029/3888*330=257.1 141/3888*906=32.9 141/3888*1910=69.3 141/3888*742=26.9 141/3888*330=12.0 718/3888*906=167.3 718/3888*1910=352.7 718/3888*742=137.0 718/3888*330=60.9 Caffeine consumption 1- 151- 300- None 150 300 900 Total _______________________________________________ Expected freq. Married 705.8 1488 578.1 257.1 3029 Divorced 32.9 69.3 26.9 12.0 141 Single 167.3 352.7 137.0 60.9 718 Total 906 1910 742 330 3888

  10. Test for Difference Among Proportions Caffeine consumption 1- 151- 300- None 150 300 900 _______________________________________________ Marital status Married 652 1537 598 242 O (705.8) (1488) (578.1) (257.1) E Divorced 36 46 38 21 O (32.9) (69.3) (26.9) (12.0) E Single 218 327 106 67 O (167.3) (352.7) (137.0) (60.9) E (652-705.8)2 / 705.8 = 4.11 (1537 – 1488)2 / 1488 = 1.61 …. Chisq = 51.6 df = 3x2=6 X2 = 1.63 for a=0.05 (O - E)2/E Married 4.11 1.61 0.69 0.89 7.30 Divorced 0.30 7.82 4.57 6.82 19.51 Single 15.30 1.88 7.02 0.60 24.86 Total 19.77 11.31 12.28 8.31 51.66

  11. Normal Distribution Phân phối chiều cao ở phụ nữ Việt Nam với trung bình 156 cm và độ lệch chuẩn 4.6 cm. Trục hoành là chiều cao và trục tung là xác suất cho mỗi chiều cao.

  12. Application of the Normal Distribution • The serum cholesterol levels of Californian children have a mean of 175 mg/100ml and a standard deviation of 30 mg/100ml. The distribution of the cholesterol levels is normal. • 95% of the children should have cholesterol levels ranged between 175 + (1.96x30) = 116 and 234 mg/100ml. • If we let X be the chol. level for any child, then X can be converted to a variable with mean=0 and SD=1: Z = (X – 175) / 30 mg/100l 116 234 175 Z -1.96 1.96 0 Abnormal? Abnormal?

  13. Two-group comparison: unpaired t-test Mean difference: D = x1 – x2 Variance of D: Group 1 Group2 x11x21 x12x22 x13x23 x14x24 x15x25 … x1nx2n Sample size n1n2 Mean x1x2 SD s1s2 T-statistic: 95% Confidence interval:

  14. Two-group comparison: an example A B 100 122 108 130 119 138 127 142 132 152 135 154 136 176 164 N 8 7 Mean 127.6 144.9 SD 19.6 17.8 Mean difference: d = 127.6 – 144.9 = -17.3 Variance of D: T-statistic: 95% Confidence interval:

  15. Analysis of Correlation ID Age Chol (mg/ml) 1 46 3.5 2 20 1.9 3 52 4.0 4 30 2.6 5 57 4.5 6 25 3.0 7 28 2.9 8 36 3.8 9 22 2.1 10 43 3.8 11 57 4.1 12 33 3.0 13 22 2.5 14 63 4.6 15 40 3.2 16 48 4.2 17 28 2.3 18 49 4.0

  16. Variance, Covariance and Correlation: Theory • Let x and y be two random variables from a sample of n obervations. • Measure of variability of x and y: variance • Measure of covariation between x and y ? • Coefficient of correlation (r)

  17. Positive and Negative Correlation r = 0.9 r = -0.9

  18. Test of Hypothesis of Correlation • Hypothesis: Ho: r = 0 versus Ho: r not equal to 0. • Step 1: Fisher’s z-transformation • Step 2: calculate standard error of z • Step 3: calculate t-statistic

  19. An Example of Correlation Analysis Cov(x, y) = 10.68 ID Age Cholesterol (x) (y; mg/100ml) 1 46 3.5 2 20 1.9 3 52 4.0 4 30 2.6 5 57 4.5 6 25 3.0 7 28 2.9 8 36 3.8 9 22 2.1 10 43 3.8 11 57 4.1 12 33 3.0 13 22 2.5 14 63 4.6 15 40 3.2 16 48 4.2 17 28 2.3 18 49 4.0 Mean 38.83 3.33 SD 13.60 0.84 t-statistic = 0.56 / 0.26 = 2.17 Critical t-value with 17 df and alpha = 5% is 2.11 Conclusion: There is a significant association between age and cholesterol.

  20. Simple Linear Regression Analysis • Only two variables are of interest: one response variable and one predictor variable • No adjustment is needed for confounding or covariate • Assessment: • Quantify the relationship between two variables • Prediction • Make prediction and validate a test • Control • Adjusting for confounding effect (in the case of multiple variables)

  21. Linear Regression: Model • Y : random variable representing a response • X : random variable representing a predictor variable (predictor, risk factor) • Both Y and X can be a categorical variable (e.g., yes / no) or a continuous variable (e.g., age). • If Y is categorical, the model is a logistic regression model; if Y is continuous, a simple linear regression model. • Model Y = a + bX + e a : intercept b : slope / gradient • : random error (variation between subjects in y even if x is constant, e.g., variation in cholesterol for patients of the same age.)

  22. Linear Regression: Assumptions • The relationship is linear in terms of the parameter; • X is measured without error; • The values of Y are independently from each other (e.g., Y1 is not correlated with Y2) ; • The random error term (e) is normally distributed with mean 0 and constant variance. • If the assumptions are tenable, then: • The expected value of Y is: E(Y | x) = a + bx • The variance of Y is: var(Y) = var(e) = s2

  23. Estimation of Model Parameters • Given two points A(x1, y1) and B(x2, y2) in a two-dimensional space, we can derive an equation connecting the points Gradient: y B(x2,y2) Equation: y = mx + a What happen if we have more than 2 points? dy A(x1,y1) dx a 0 x

  24. Method of Least Squares • For a series of pairs: (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) • Let a and b be sample estimates for parameters a and b, • We have a sample equation: Y* = a + bx • Aim: finding the values of a and b so that (Y – Y*) is minimal. • Let SSE = sum of (Yi – a – bxi)2. • Values of a and b that minimise SSE are called least square estimates.

  25. Criteria of Estimation yi Chol Age The goal of least square estimator (LSE) is to find a and b such that the sum of d2 is minimal.

  26. Least squares Estimates • After some calculus operations, the results can be shown to be: • Where: • When the regression assumptions are valid, the estimators of a and b have the following properties: • Unbiased • Uniformly minimal variance (eg efficient)

  27. Goodness-of-fit • Now, we have the equation Y = a + bX • Question: how well the regression equation describe the actual data? • Answer: coefficient of determination (R2): the amount of variation in Y is explained by the variation in X.

  28. Partitioning of variations: geometry SSE Chol (Y) • SST = sum of squared difference between yi and the mean of y. • SSR = sum of squared difference between the predicted value of y and the mean of y. • SSE = sum of squared difference between the observed and predicted value of y. SST = SSR + SSE • The the coefficient of determination is: R2 = SSR / SST SST SSR mean Age (X)

  29. Linear Regression Analysis by R age <- c(46,20,52,30,57,25,28,36,22,43,57,33,22,63,40,48,28,49) chol <- c(3.5,1.9,4.0,2.6,4.5,3.0,2.9,3.8,2.1,3.8,4.1,3.0,2.5,4.6,3.2,4.2,2.3,4.0) lipid <- data.frame(age,chol) attach(lipid) results <- lm(chol ~ age) summary(results) Residuals: Min 1Q Median 3Q Max -0.40729 -0.24133 -0.04522 0.17939 0.63040 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.089218 0.221466 4.918 0.000154 *** age 0.057788 0.005399 10.704 1.06e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3027 on 16 degrees of freedom Multiple R-Squared: 0.8775, Adjusted R-squared: 0.8698 F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08

  30. Interpretation of Model Estimates Cholesterol = 1.089 + 0.0578(Age) Estimate Std. Error t value Pr(>|t|) (Intercept) 1.089218 0.221466 4.918 0.000154 *** age 0.057788 0.005399 10.704 1.06e-08 *** • Interpretation: Cholesterol is increased by 0.0578 mg/ml for each year increase in age. The association between age and cholesterol is statistically significant (p = 1.06e-08). R-squared = 0.8698 • Interpretation: Variation in age “explained” 85% variation in cholesterol.

  31. Prediction plot(chol ~ age) abline(results) Regression line: Chol = 1.089 + 0.0578(Age)

  32. Checking Assumptions par(mfrow=c(2,2)) plot(results)

  33. The Importance of Assumption: BMI and Sexual Attractiveness bmi <- c(11.0, 12.0, 12.5, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.8, 15.0, 15.0, 15.5, 16.0, 16.5, 17.0, 17.0, 18.0, 18.0, 19.0, 19.0, 20.0, 20.0, 20.0, 20.5, 22.0, 23.0, 23.0, 24.0, 24.5, 25.0, 25.0, 26.0, 26.0, 26.5, 28.0, 29.0, 31.0, 32.0, 33.0, 34.0, 35.5, 36.0, 36.0) sa <- c(2.0, 2.8, 1.8, 1.8, 2.0, 2.8, 3.2, 3.1, 4.0, 1.5, 3.2, 3.7, 5.5, 5.2, 5.1, 5.7, 5.6, 4.8, 5.4, 6.3, 6.5, 4.9, 5.0, 5.3, 5.0, 4.2, 4.1, 4.7, 3.5, 3.7, 3.5, 4.0, 3.7, 3.6, 3.4, 3.3, 2.9, 2.1, 2.0, 2.1, 2.1, 2.0, 1.8, 1.7) beauty <- data.frame(bmi,sa) attach(beauty) results <- lm(sa ~ bmi) summary(results) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.92512 0.64489 7.637 1.81e-09 *** bmi -0.05967 0.02862 -2.084 0.0432 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.354 on 42 degrees of freedom Multiple R-Squared: 0.09376, Adjusted R-squared: 0.07218 F-statistic: 4.345 on 1 and 42 DF, p-value: 0.04323

  34. Incorrect Functional Form

  35. Cubic Regression results<-lm(sa ~ poly(bmi,3)) summary(results) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.6500 0.1193 30.587 < 2e-16 *** poly(bmi, 3)1 -2.8228 0.7915 -3.566 0.000957 *** poly(bmi, 3)2 -5.9749 0.7915 -7.548 3.27e-09 *** poly(bmi, 3)3 4.0324 0.7915 5.094 8.76e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.7915 on 40 degrees of freedom Multiple R-Squared: 0.7051, Adjusted R-squared: 0.683 F-statistic: 31.88 on 3 and 40 DF, p-value: 1.077e-10 SA = 3.65 – 2.82(BMI) – 5.97(BMI)2 + 4.03(BMI)3

  36. Sexual Attractiveness and BMI: Cubic Function bmi.new <- (10:40) sa.pred = predict(results, data.frame(bmi=bmi.new)) plot(sa ~ bmi) lines(bmi.new, sa.pred, col="blue", lwd=3)

More Related