Section II Descriptive stats for continuous data

Section II Descriptive stats for continuous data Descriptive stats for binary data and bivariate associations in binary data

Types of data Numerical: Continuous-age, SBP,glucose Interval-parity, num infections Ordinal (ranks) Cancer stage, Apgar score Nominal (no order) Gender, ethnicity, treatment

Dataset used to illustrate some statistics in this section Stomach cancer survival times in controls (Cameron & Pauling, PNAS, Oct 1976) Days from end of treatment to death 4, 6, 8, 8, 12, 14, 15, 17, 19, 22, 24, 34,45 n= 13 subjects

Measures of central tendency (middle) Data: 4, 6, 8, 8,12, 14, 15, 17, 19, 22, 24, 34,45 mean = 17.5 days median = 15 days mode = 8 days Geometric mean-GM=13√4x6x8x8x…x45=14.25 If we delete the most extreme value, 45, mean is now 15.24, median is 14.5, GM=13, median changes least

Mean versus Median (lesson #1 in how to lie with statistics) Yearly income data from n=11 persons, one income is for Dr Brilliant, the other 10 incomes from her 10 graduate students Yearly income in dollars 950 960 970 980 990 1010 1020 1030 1040 1050 $100,000 $110,000 (total) mean = 110,000/11 = $10,000, median = 1010 (the sixth ordered value) Which is better summary of “typical” value?

Example - Survival times in women with advanced Breast Cancer Survival time in days after end of radiotherapy woman after 275 days f/u after 305 days f/u 1 14 14 2 26 26 3 43 43 4 45 45 5 50 50 6 58 58 7 60 60 8 62 62 9 70 70 10 70 70 11 83 83 12 98* 128* 13 104* 134* 14 124* 154* 15 125* 155* 16 275* 305* mean 75.6 83.1 median 66.0 66.0 SD 55.8 66.3 * still alive (censored) The median is still a valid measure when less than half the data are censored.

Cumulative frequencies & survival num pct cum cum pct cum pct Days dead dead dead dead alive=S 1-10 4 30.8 4 30.8 69.2 11-20 5 38.5 9 69.2 30.8 21-30 2 15.4 11 84.6 15.4 31-40 1 7.7 12 92.3 7.7 41-50 1 7.7 13 100.0 0 total 13

Stomach cancer survival time in days

Bevacizumab & Ovarian CancerBerger et. al. NEJM Dec 2011

Why survival curves?

Summarizing mortality – hazard rates Hazard rate = h = number of persons with outcome total person-time follow up in all at risk This is a rate per person-time. It is NOT a probability (not a risk) In stomach cancer n=13, with 13 deaths, total follow up is 4+6+8+8+12+14+15+17+19+22+24+34+45 = 228 person-days Hazard rate = mortality rate = 13/228 = 0.057 or 5.7 deaths per 100 person-days of follow up. Do NOT report as 5.7%-wrong

Example: Why hazard rates? Group n num dead mean f/u total f/u rate per 1000 A 100 7 36 3600 7/3600=1.94 B 100 2 3 300 2/300 =6.66 Mortality rate is higher for B than A even though the number of persons in each group is the same and more people died in group A. The hazard rate ratio for A/B is 1.94/6.66=0.291. When ALL patients are followed to the endpoint, (no censoring) mean time to event= 1/hazard.

Hazard rates & survival curves loge(S) = cum haz= h t, h is (average) slope of loge(S) vs t

Hazard rate ratios & Survival curves ha = hazard rate in group A hb = hazard rate in group B, hazard rate ratio, (HR) for A compared to B is HR = ha/hb If HR is constant over time one can compute the Survival in group A from the Survival in group B. Sa = SbHR Ex: HR=0.291, S at t=12 mos is 90% in group B, S=0.900.291 = 0.970 or 97.0% in group A at t=12 months. A “protective” HR < 1 increases survival. HR >1 decreases survival.

Cumulative hazard rate Loge(S)=Cumulative hazard = Σt hi = ∫h(t)dt If h is constant over time Cumulative hazard = h T where T is the follow up time. In this case, h = cum hazard/T h is the slope of the cum hazard vs t plot.

From: Risks and Benefits of Estrogen Plus Progestin in Healthy Postmenopausal Women: Principal Results From the Women's Health Initiative Randomized Controlled Trial JAMA. 2002;288(3):321-333. HR indicates hazard ratio; nCI, nominal confidence interval; andaCI, adjusted confidence interval. Global index = first occurrence of CHD, cancer, stroke, pulmonary embolism, hip fracture or death.

Distribution skewness Long right tailed distribution median < mean (common for survival data)

Example: ICU length of stay(Howard) n=94, mean=11.3 days, median= 6 days min=1 day, max=80 days

Skewness Long left tailed distribution median > mean (not as common in biology/medicine)

Symmetric(common in biology) mean median Can be symmetric without being bell curve shaped – has one mode When data has a skewed distribution, must use “non parametric” methods

Measures of variation, spread IQR – interquartile range

Box-whisker plot Q1 median Q3 min max mean

Variation-Variance & SD _ Mean = Y= 17.54 days _ _ Y Y-Y (Y-Y)2 4 -13.54 183.3 6 -11.54 133.2 8 -9.54 91.0 8 -9.54 91.0 12 -5.54 30.7 14 -3.54 12.5 15 -2.54 6.5 17 -0.54 0.3 19 1.46 2.1 22 4.46 19.9 24 6.46 41.7 34 16.46 270.9 45 27.46 754.1 sum 0 1637.2 _ Variance = (Yi - Y)2 (n-1) Var=1637.2/12=136.4 SD=√Variance=√136.4= 11.6 days

Variation- Interpreting the SD Rule of thumb from Gaussian (“Normal”) theory (will study more shortly) rule ok if data has unimodel symmetric distribution Range of middle 2/3 of the data: mean +/- SD Range of middle 95% of the data: mean +/- 2 SD Implies SD ≈ range/4 (after extreme values removed from range)

SD of differences-paired datachol in mmol/L person chol at start chol at end difference 1 12.6 10.0 2.6 2 8.5 7.5 1.0 3 7.0 5.8 1.2 4 6.9 4.9 2.0 5 5.8 4.0 1.8 6 4.1 3.8 0.3 mean 7.48 6.00 1.48 SD 2.90 2.38 0.82 Corr of start vs end: r=0.971

If authors only report (mmol/L) start end change?? mean 7.48 6.00 SD 2.90 2.38 Easy to get mean difference=7.48 – 6.00=1.48 But can’t get SD of differences 2.90 - 2.38 = 0.52 ≠ 0.82 The 1.48 mean diff is average response The 0.82 diff SD is variation in response. SDdiff= √ SD2start +SD2end – 2 r SDstart SDend r= correlation coeff

SD of differencestwo independent groups Comparing ages in groups A vs B Data->

Rule for SD of differencestwo independent groups Var(Y - X) = Var(Y) + Var(X) Var(Y + X) = Var(Y) + Var(X) SD(Y-X)= √ SD2(Y) + SD2(X) SD(Y+X)=√ SD2(Y) + SD2(X) SD(Y-X) SD(Y) SD(X)

BINARY DATA Statistics

Associations for Binary data risk=Podds=O Pe= a/(a+b) Oe= a/b Pu = c/(c+d) Ou= c/d RR =Pe/Pu OR= Oe/Ou

Risk vs Odds P=risk, O=odds O=P/(1-P), P=O/(1+O) P=1/10, O=1/9. Risk=num sick/total Odds=num sick/num not sick RR = OR/(1 – Pu + OR Pu) When Pu is small, RR=OR In general, OR is more extreme than RR

Oral Contraceptive exposure vs Cancer Prospective study (unbiased est of pop)

Ratios and differences For rare events or diseases Pe=1/10,000, Pu= 1/100,000 RR = 10, risk difference = 9/100,000 Misleading to only report ratio and not actual risks.

Odds-case control study

Why use ORs? 1.In prospective study, usually quote disease risk & risk ratio (RR). In case-control, we always quote OR, not RR. Case-control OR of exposure in disease/no disease Equals Prospective OR of disease in exposed/unexposed in population if the probability of exposure is same as in the target population. (Not necessarily true if there is confounding, bias). 2. OR more “stable” (universal) across studies. If unexposed risk=20%, RR=2, exposed risk=40% If unexposed risk=60%, RR can’t be 2.

Independence rule for ORs ORs for heart attack (MI) For smokers/non smoker: OR = 4 For alcohol/no alcohol: OR = 2 Ifindependent, OR for those who smoke AND drink alcohol is 4 x 2 = 8 (relative to no smoke, no alcohol). Only true if smoking, drinking are independent influences on MI. However, smoking & drinking can be correlated with each other.

NNT – number needed to treat (or harm) (clinical trials) Pc (like Pu)=prop w/ disease in control group Pt (like Pe)=prop w/ disease in treat group ARR=absolute risk reduction= risk difference= RD=Pc-Pt RRR=Relative risk reduction=(Pc-Pt)/Pc = ARR/Pc=1-RR NNT=number needed to treat=1/ARR

NNT Example Pc=0.36=36%, Pt=0.34=34% ARR=RD=0.02=2% RRR=0.2/0.36 = 5.5% (a percent of a percent) NNT = 1/0.02 = 50 So 50 patients must be given the treatment to cure one additional disease case. Can be extended to more complex stats.

NNT–Ovarian Ca screening “Tests commonly recommended to screen healthy women for ovarian cancer do more harm than good and should not be performed, a panel of medical experts said on Monday. The screenings —blood tests for a substance linked to cancer and ultrasound scans to examine the ovaries — do not lower the death rate from the disease, and they yield many false-positive results that lead to unnecessary operations with high complication rates, the panel said. … “To find one case of ovarian cancer, 20 women had to undergo surgery. “ (NY Times–10 Sept 2012)

Summary-Ratios RiskOddsHazard P O h Ratio: RR=Pe/Pu OR=Oe/Ou HR=he/hu All have the null value of 1.0 when there is no association. The distribution of the logs of their ratios from study to study are usually bell curve shaped around the true log scale value.

Sensitivity and Specificity Sensitivity=a/(a+c), false negative=c/(a+c) Specificity=d/(b+d), false positive=b/(b+d) Positive predictive value=PPV=a/(a+b) * Negative predictive value=NPV=d/(c+d) * * Depends on disease prevalence-not just attribute of test

Sensitivity, Specificity, Accuracy Accuracy = W Sensitivity + (1-W) Specificity where 0 < W < 1. Often W=0.5 (unweighted accuracy) We wish to maximize accuracy=minimize misclassification = 1- Accuracy Choose W depending on “costs”.

ROC curve–choose continuous data cutpoint (threshold) for highest accuracy, best “separation”

“Modern” format for ROC Highest accuracy is NOT necessarily where sens=spec, only when SD1=SD2

“Traditional” ROC(not recommended-hard to label cutpoints)

C (concordance) statistic for ROC C = area under the “traditional” ROC curve 0.5 (bad) < C < 1.0 (good) If nd=a+c true num w/disease nnd=b+d true num w/o disease From all possible nd x nnd pairs with one diseased and one not, call a pair “concordant” if diseased is positive and non diseased is negative. C is the proportion of the pairs that are concordant.

Positive and Negative predictive value Positive predictive value (PPV) & negative predictive value (NPV) depend on sensitivity (sens), specificity (spec) & disease prevalence (P). Sensitivity and specificity do NOT depend on disease prevalence. Can only compute PPV=a/(a+b) & NPV=d/(c+d) when disease prevalence P = (a+c)/(a+b+c+d) = (a+c)/n Bayes formulas for PPV and NPV Let P = prevalence of disease PPV = test true pos/ (test true pos + test false pos) = sens x P / [ sens x P + (1- spec) x (1- P) ] NPV = test true neg/ (test true neg + test false neg) = spec x (1-P) / [ spec x (1-P) + (1-sens) x P ] But don’t use these formulas – there is an easier way

Example Sens = 95/100=0.95, Spec= 1980/2000 = 0.99, Disease prevalence=P = 100/2100 = 0.0476 PPV = (0.95 x 0.0476) / [ 0.95 x 0.0476 + 0.01 x 0.9524 ] = 0.826 PPV = 95/115=0.826 NPV = (0.99 x 0.9524) / [0.99 x 0.9524 + 0.05 x 0.0476] = 0.9974 NPV = 1980/1985 = 0.9974

Section II Descriptive stats for continuous data

Section II Descriptive stats for continuous data

Presentation Transcript

Descriptive Data Summarization (Understanding Data)

Descriptive Exploratory Data Analysis II

Section 7 – Continuous Distributions

Section 7 – Continuous Distributions

Continuous Data

Descriptive Data Analysis

Using Data for Continuous Improvement

Stats/Methods II

6.5 DESCRIPTIVE STATISTICS FOR GROUPED DATA

Anonymity for Continuous Data Publishing

Continuous Data

GRAPHICAL DESCRIPTIVE STATISTICS FOR QUANTITATIVE DATA

Descriptive Statistics-II

Stats/Methods II

Stats/Methods II

Continuous Data Protector

Stats/Methods II

EART20170 data analysis lecture 2: descriptive stats and outliers

Section 1B Descriptive Statistics

Descriptive Statistics-II

Descriptive Statistics-II (More on Graphs and Data Summaries)

EART10160 stats / data analysis descriptive stats and outliers