Evaluating Results

Evaluating Results

The Best Evidence So far, we’ve learned that a good experiment or clinical trial is: • Randomized • Double-blind • Controlled This is often abbreviated ‘RCT’: Randomized Controlled Trial.

Randomization An experiment or trial is randomized when each person who is participating in the experiment/ trial has a fair and equal chance of ending up either in the control group or the experimental group.

Benefits of Randomization Proper randomization: Minimizes experimenter bias– the experimenter can’t bias who goes into which group. Minimizes allocation bias– lowers the chance that the control group and experimental group differ in important ways.

Experimenter Bias Experimenter bias is a type of selection bias– who gets “selected” to be in which group. If the experimenter has a role in who goes into which group, she can (consciously or unconsciously) unfairly bias the experiment.

Selection Bias Randomization cannot get rid of all selection bias. For example, many psychology experiments are just performed on American undergraduates by their professors. This means both groups over-represent young Westerners. (“Sampling bias”)

Selection Bias Selection bias can also occur when people drop out of a study. Suppose you are studying whether a certain pill has side-effects. If everyone who experiences side-effects quits the study before results are examined, it will look like there are no side-effects.

Allocation Bias Randomization also guards against allocation bias, where the control group and experimental group are different in important ways. For example, if you assign the first 20 people to enroll in the experiment to the control and the next 20 to the experimental group, there may be allocation bias: the first to enroll may be more eager to take part, because they are sicker.

The Importance of Randomization Last time we saw that improper randomization procedures on average exaggerated effects by 41%. This is an average result, so improper randomization often leads to exaggerations that are even larger than 41%.

Blinding Single-blind experiments are ones where the participants don’t know whether they are in the control group or the experimental group. Double-blind experiments are ones where the experimenters don’t know which participants are in the control group and which participants are in the experimental group.

Blinding Participants Blinding the participants is important for several reasons: People tend to conform to expectations. So if they think they should get better, or behave in a certain way they will try harder, behave in that way, and report that they’ve gotten better or behaved correctly (even if they haven’t).

Blinding Participants Last time we also saw the power of the placebo effect. When people think they will get better by undergoing treatment, they do get better, even if the treatment is ineffectual.

Blinding Experimenters Blinding experimenters has two main points: First, if experimenters are not blind, they may accidentally indicate to participants which group they are in. Second, if experimenters are not blind, they may bias the study by measuring or treating participants differently.

Controls An experiment with no controls is useless. It tells us what happens when we do X, but not what happens when we don’t do X (control). Maybe the same results would happen from not doing X. Maybe X does nothing. Or a lot. Or a little. With no controls, it is impossible to tell.

Experiments and Correlations As we learned before, experimental studies are just a special way of observing correlations between two variables. They are specially designed to avoid confounding variables, and to make sure any correlations found are between the variables we are studying.

Significance But once we find a correlation we need to ask: how strong is the correlation? What does this study tell me about real life? If drinking wine and living longer are correlated, does that mean if I drink a glass of wine a day I’ll live one day longer? One year longer? 10 years longer?

Hypothetico-Deductive Method Remember that science is the process of using our theories to generate hypotheses which are then tested against the world by gathering data from observations and experiments. In an RCT, the hypothesis is that there is a causal connection between two variables (for example, taking a drug and getting better).

Null Hypothesis In this context, the hypothesis that there is no causal connection between the variables being studied is called the null hypothesis. Our goal is to reject the null hypothesis when it is false, and to accept it when it is true.

Rejecting the Null Hypothesis All experimental data is consistent with the null hypothesis. Any correlation can always be due entirely to chance. But sometimes the null hypothesis doesn’t fit the data very well. When the null hypothesis suggests that our actual observations are very unlikely, we reject the null hypothesis.

P-Values One way to characterize the significance of an observed correlation is with a p-value. The p-value is the probability that we would observe our data on the assumption that the null hypothesis is true. p = P(observations/ null hypothesis = true)

P-Values Obviously lower p-values are better, that means your observed correlation is more likely to be true. In science we have an arbitrary cut-off point, 5%. We say that an experimental result with p < .05 is statistically significant.

Statistical Significance What does p < .05 mean? It means that the probability that our experimental results would happen if the null hypothesis is true is less than 5%. According to the null hypothesis, there is less than a 1 in 20 chance that we would obtain these results.

Note Importantly, p-values are not measures of how likely the null hypothesis is, given the data. They are measures of how likely the data is, given the null hypothesis. p = P(data/ null hypothesis = true) ≠ P(null hypothesis = true/ data)

Example Suppose I have a coin, and I hypothesize that the coin is biased toward heads. The null hypothesis might be “this is a fair coin, it is equally likely to land heads or tails”. Suppose I then flip it 5 times and it lands HHHHH– heads 5 times in a row.

Example We know that the probability of this happening if the coin is fair is 1/(2^5) = 1/32 = 0.03125 or about 3%. P(HHHHH/ the coin is fair) = P(HHHHH/ null hypothesis = true) = p = 3%

Example So p = .03 < .05, and we can reject the null hypothesis. The bias toward heads is statistically significant.

Notice that we cannot estimate the likelihood of the null hypothesis: P(n.h. = true/ n.h. = true) = [P(data) x P(n.h. = true/ data)] ÷ P(n.h.) We know P(data/ n.h. = true) = 3%. But what are P(HHHHH) and P(n.h.)?

Importance Just because the results of an experiment (or observational study) are “statistically significant” does not mean the revealed correlations are important. The effect size also matters, that is the strength of the correlation.

Effect Size One NAEP analysis of 100,000 American students found that science test scores for men were higher than the test scores for women, and this effect was statistically significant (unlikely if the null hypothesis, that gender plays no role in science scores, were true).

Effect Size However, the average difference between men and women on the test was just 4 points out of 300, or 1.3% of the total score. Yes, there was a real (statistically significant) difference. It was just a very, very small difference.

Effect Size One way to put the point might be: “p-values tell you when to reject the null hypothesis. But they do not tell you when to care about the results.”

Measures of Effect Size There are lots of measures of effect size: Pearson’s r, Cohen’s f, Cohen’s d, Hedges’ g, Cramér’s V,… Here we will just talk about two measures that are commonly reported: odds ratios and relative risks.

Odds Ratio First, let’s introduce the idea of a binary variable. A binary variable is a variable that can have only two values. “height” is not a binary variable, because there are more than two heights people can have. “got an A” is a binary variable, because either you got an A or you didn’t.

Odds Whenever you have a binary variable, you can ask about the odds of that variable– what are the odds of getting an A? If 10 students got A’s out of 50 students, then 10 students passed and 40 failed. The odds of getting an A are 10:40 or 1:4 or 25%.

Odds vs. Probabilities Odds are not probabilities. There are 50 students and 10 of them got A’s. The probability of getting an A: 10/50 = 20% The odds of getting an A: 10/40 = 25%

Odds Ratios Suppose I have another binary variable “studied”– students either studied for the exam or they didn’t. I can ask about the odds that a student who studied got an A, and the odds that a student who didn’t study got an A.

In Table Format

Odds Ratio So the odds of getting an A among studiers are 6:15 or 40%. And the odds of getting an A among non-studiers are 4/25 or 16%.

Odds Ratio The odds ratio is the ratio of these odds, or 40%:16% ≈ 2.5 This means that (in our example) studying raises the odds that someone will get an A by 150%. Alternatively: a student who studies has two and a half times better odds of getting an A.

Relative Risk While odds ratios are appropriate when we have two correlated binary variables in an observational study (as when I observe the effects of studying on getting an A), the effect sizes in RCTs are usually reported by relative risks, which are also called risk ratios.

Relative Risk Relative risks are just like odds ratios except they compare probabilities and not odds. The odds that a studying student passes are 6:15 = 40% The probability is 6/(6 + 15) = 6/21 ≈ 29%

Example The odds that a non-studying student passes are 4:25 = 16%. The probability is 4/(4 + 25) = 4/29 ≈ 14%.

Example Whereas the odds ratio was 40:16 = 250%, we get a relative risk of: 29%:14% = 29:14 = 2.07 = 207% These numbers are similar, but obviously not the same. The risk ratio tells you that a student who studies is twice as likely to get an A.

Relation As the probabilities of events get smaller the odds approach the probabilities, and odds ratios and relative risks are similar. However, as the probabilities of the events get higher, the odds and risk ratios get very different.

Here’s our table again…

Odds Ratio for High Probability Events The probability of not getting an A is much higher than the probability of getting an A: 40/50 >> 10/50. The odds of study = no, A = no: 25/4 = 6.25 The odds of study = yes, A = no: 15/6 = 2.5 Odds ratio: 6.25/2.5 = 250%. Not studying increases odds of A = no by one and a half times.

Relative Risk for High Probability Events What about probabilities? P(A = no/ study = no) = 86% P(A = no/ study = yes) = 71% Relative risk = 86/71 = 121% So not studying increases your risk of not getting an A by 21%.

What This Means What this means is that if you see an effect size reported in the news you must know whether it is an odds ratio or a risk ratio. Otherwise a seemingly very big difference might actually be a very small difference.

Real Life Case Here’s a real headline from the NY Times: “Doctors are only 60% as likely to order cardiac catheterization for women and blacks as for men and whites.” This sounds like a risk ratio. Doctors refer white men n% of the time and blacks and women 60% of n% of the time. Right?

Evaluating Results

Evaluating Results

Presentation Transcript

Evaluating Periodicals

Evaluating Performance

Evaluating Websites

EVALUATING

Evaluating

Monitoring and Evaluating Results

Evaluating

Evaluating

Evaluating

Evaluating Small-Scale Results of Activity-Based Models

Evaluating and Communicating Model Results: Guidebook for Planners

Evaluating, Scoring, and Writing Comments for Results Items

Measuring Results That Matter: Evaluating CED Impacts

Evaluating

Evaluating Hiera r chical Clustering of Search Results

Looking for Results: Principles for Evaluating Student Success Initiatives

Results Items Evaluating Drafting Comments Scoring

Evaluating debt relief: challenges and results

IL Step 4: Evaluating Search Results

Evaluating

Evaluating

Evaluating Results of Learning