Introduction to Biostatistics for Clinical and Translational Researchers

Introduction to Biostatistics for Clinical and Translational Researchers KUMC Departments of Biostatistics & Internal Medicine University of Kansas Cancer Center FRONTIERS: The Heartland Institute of Clinical and Translational Research

Course Information • Jo A. Wick, PhD • Office Location: 5028 Robinson • Email: jwick@kumc.edu • Lectures are recorded and posted at http://biostatistics.kumc.edu under ‘Events and Opportunities’

Inferences: Hypothesis Testing

Experiment • An experiment is a process whose results are not known until after it has been performed. • The range of possible outcomes are known in advance • We do not know the exact outcome, but would like to know the chances of its occurrence • The probability of an outcome E, denoted P(E), is a numerical measure of the chances of E occurring. • 0 ≤ P(E) ≤ 1

Probability • The most common definition of probability is the relative frequency view: • Probabilities for the outcomes of a random variable x are represented through a probability distribution: Probability of LOS = 6 days Length of stay = 6 days

Population Parameters • Most often our research questions involve unknown population parameters: What is the average BMI among 5th graders? What proportion of hospital patients acquire a hospital-based infection? • To determine these values exactly would require a census. • However, due to a prohibitively large population (or other considerations) a sample is taken instead.

Sample Statistics • Statistics describe or summarize sample observations. • They vary from sample to sample, making them random variables. • We use statistics generated from samples to make inferences about the parameters that describe populations.

Sampling Variability Samples μσ Population Sampling Distribution of

Recall: Hypotheses • Null hypothesis “H0”: statement of no differences or association between variables • This is the hypothesis we test—the first step in the ‘recipe’ for hypothesis testing is to assume H0 is true • Alternative hypothesis “H1”:statement of differences or association between variables • This is what we are (usually) trying to prove

Hypothesis Testing • One-tailed hypothesis: outcome is expected in a single direction (e.g., administration of experimental drug will result in a decrease in systolic BP) • H1 includes ‘<‘ or ‘>’ • Two-tailed hypothesis: the direction of the effect is unknown (e.g., experimental therapy will result in a different response rate than that of current standard of care) • H1includes ‘≠‘

Hypothesis Testing • The statistical hypotheses are statements concerning characteristics of the population(s) of interest: • Population mean: μ • Population variability: σ • Population rate (or proportion): π • Population correlation: ρ • Example: It is hypothesized that the response rate for the experimental therapy is greater than that of the current standard of care. • πExp > πSOC ← This is H1.

Recall: Decisions • Type I Error (α): a true H0 is incorrectly rejected • “An innocent man is proven GUILTY in a court of law” • Commonly accepted rate is α = 0.05 • Type II Error (β): failing to reject a false H0 • “A guilty man is proven NOT GUILTY in a court of law” • Commonly accepted rate is β = 0.2 • Power (1 – β): correctly rejecting a false H0 • “Justice has been served” • Commonly accepted rate is 1 – β = 0.8

Decisions

Basic Recipe for Hypothesis Testing • State H0 and H1 • Assume H0 is true ← Fundamental assumption!! • Collect the evidence—from the sample data, compute the appropriate sample statistic and the test statistic • Test statistics quantify the level of evidence within the sample—they also provide us with the information for computing a p-value (e.g., t, chi-square, F) • Determine if the test statistic is large enough to meet the a priori determined level of evidence necessary to reject H0(. . . or, is p < α?)

Example: Carbon Monoxide • An experiment is undertaken to determine the concentration of carbon monoxide in air. • It is a concern that the actual concentration is significantly greater than 10 mg/m3. • Eighteen air samples are obtained and the concentration for each sample is measured. • The outcome x is carbon monoxide concentration in samples. • The characteristic (parameter) of interest is μ—the true average concentration of carbon monoxide in air.

Step 1: State H0 & H1 • H1: μ > 10 mg/m3 ← We suspect! • H0: μ≤10 mg/m3 ← We assume in order to test! Step 2: Assume μ = 10 μ = 10

Step 3: Evidence Sample statistic: Test statistic: What does 1.79 mean? How do we use it?

Student’s t Distribution • Remember when we assumed H0 was true? Step 2: Assume μ = 10 μ = 10

Student’s t Distribution • What we were actually doing was setting up this theoretical Student’s t distribution from which the p-value can be calculated: t = 0

Student’s t Distribution • Assuming the true air concentration of carbon monoxide is actually 10 mg/mm3, how likely is it that we should get evidence in the form of a sample mean equal to 10.43? Step 2: Assume μ = 10 μ = 10

Student’s t Distribution • We can say how likely by framing the statement in terms of the probability of an outcome: p = P(t ≥ 1.79) = 0.0456 t = 0 t = 1.79

Step 4: Make a Decision • Decision rule: if p≤ α, the chances of getting the actual collected evidence from our sample given the null hypothesis is true are very small. • The observed data conflicts with the null ‘theory.’ • The observed data supports the alternative ‘theory.’ • Since the evidence (data) was actually observed and our theory (H0) is unobservable, we choose to believe that our evidence is the more accurate portrayal of reality and reject H0 in favor of H1.

Step 4: Make a Decision • What if our evidence had not been in as great of degree of conflict with our theory? • p > α: the chances of getting the actual collected evidence from our sample given the null hypothesis is true are pretty high • We fail to reject H0. 10

Decision • How do we know if the decision we made was the correct one? • We don’t! • If α = 0.05, the chances of our decision being an incorrect reject of a true H0 are no greater than 5%. • We have no way of knowing whether we made this kind of error—we only know that our chances of making it in this setting are relatively small.

Which test do I use? • What kind of outcome do you have? • Nominal? Ordinal? Interval? Ratio? • How many samples do you have? • Are they related or independent?

Types of Tests

Types of Tests • Parametric methods: make assumptions about the distribution of the data (e.g., normally distributed) and are suited for sample sizes large enough to assess whether the distributional assumption is met • Nonparametric methods: make no assumptions about the distribution of the data and are suitable for small sample sizes or large samples where parametric assumptions are violated • Use ranks of the data values rather than actual data values themselves • Loss of power when parametric test is appropriate

Types of Tests

Comparing Central Tendency

Two-Sample Test of Means • Clotting times (minutes) of blood for subjects given one of two different drugs: • It is hypothesized that the two drugs will result in different blood-clotting times. • H1: μB ≠ μG • H0: μB= μG

Two-Sample Test of Means • What we’re actually hypothesizing: H0: μB-μG = 0 Evidence! • μB-μG = 0

Two-Sample Test of Means • What we’re actually hypothesizing: H0: μB-μG = 0 p = P(|t| > -2.475) = 0.03 • t = -2.48 • t = 0 • t = +2.48 ***Two-sided tests detect ANY evidence in EITHER direction that the null difference is unlikely!

Assumptions of t • In order to use the parametric Student’s t test, we have a few assumptions that need to be met: • Approximate normality of the observations • In the case of two samples, approximate equality of the sample variances

Assumption Checking • To assess the assumption of normality, a simple histogram would show any issues with skewness or outliers:

Assumption Checking • Skewness

Assumption Checking • Other graphical assessments include the QQ plot:

Assumption Checking • Violation of normality:

Assumption Checking • To assess the assumption of equal variances (when groups = 2), simple boxplots would show any issues with heteroscedasticity:

Assumption Checking • Rule of thumb: if the larger variance is more than 2 times the smaller, the assumption has been violated

Now what? • If you have enough observations (20? 30?) to be able to determine that the assumptions are feasible, check them. • If violated: • Try a transformation to correct the violated assumptions (natural log) and reassess; proceed with the t-test if fixed • If a transformation doesn’t work, proceed with a non-parametric test • Skip the transformation altogether and proceed to the non-parametric test • If okay, proceed with t-test.

Now what? • If you have too small a sample to adequately assess the assumptions, perform the non-parametric test instead. • For the one-sample t, we typically substitute the Wilcoxon signed-rank test • For the two-sample t, we typically substitute the Mann-Whitney test

Consequences of Nonparametric Testing • Robust! • Less powerful because they are based on ranks which do not contain the full level of information contained in the raw data • When in doubt, use the nonparametric test—it will be less likely to give you a ‘false positive’ result.

Speaking of Power • “How many subjects do we need?” • Statistical methods can be used to determine the required number of patients to meet the trial’s principal scientific objectives. • Other considerations that must be accounted for include availability of patients and resources and the ethical need to prevent any patient from receiving inferior treatment. • We want the minimum number of patients required to achieve our principal scientific objective.

The Size of a Clinical Trial • For the chosen level of significance (type I error rate, α), a clinically meaningful difference (Δ) between two groups can be detected with a minimally acceptable power (1 – β) with n subjects.

Example: Detecting a Difference • Primary objective: To compare pain improvement in knee OA for new treatment A compared to standard treatment S. • Primary outcome: Change in pain score from baseline to 24 weeks (continuous). • Data analysis: Comparison of mean change in pain score of patients on treatment A (μ1) versus standard (μ2) using a two-sided t-test at the α = 0.05 level of significance.

Example: Detecting a Difference • Difference to detect (Δ): It has been determined that a difference of 10 on this pain scale is clinically meaningful. • If standard therapy results in a 5 point decrease, our new therapy would need to show a decrease of at least 15 (5 + 10) to be declared clinically different from the standard. • We would like to be 80% sure that we detect this difference as statistically significant.

Example: Detecting a Difference • What usually occurs on the standard? • This is important information because it tells us about the behavior of the outcome (pain scale) in these patients. • If the pain scale has great variability, it may be difficult to detect small to moderate changes (signal-to-noise)!

‘Signal-to-Noise’

Example: Detecting a Difference • We have: • H0: μ1 = μ2 versus H1: μ1μ2(Δ= 0) • α = 0.05 • 1 – β = 0.80 • Δ= 10 • For continuous outcomes we need to determine what difference would be clinically meaningful, but specified in the form of an effect size which takes into account the variability of the data.

Example: Detecting a Difference • Effect size is the difference in the means divided by the standard deviation, usually of the control or comparison group, or the pooled standard deviation of the two groups where

Introduction to Biostatistics for Clinical and Translational Researchers