Reading and reporting evidence from trial-based evaluations

Reading and reporting evidence from trial-based evaluations Professor David Torgerson Director, York Trials Unit www.rcts.org

Background • Good quality randomised controlled trials (RCTs) are the best form of evidence to inform policy and practice. • However, poorly conducted RCTs may be more misleading than other types of evidence.

RCTs – a reminder • Randomised controlled trials (RCTs) provide the strongest basis for causal inference by: • Controlling for regression to the mean effects; • Controlling for temporal changes; • Providing a basis for statistical inference; • Removing selection bias.

Selection Bias • Selection bias can occur in non-randomised studies when group selection is related to a known or unknown prognostic variable. • If the variable is either unknown or imperfectly measured then it is not possible to control for this confound and the observed effect may be biased.

Randomisation • Randomisation ONLY ensures removal of selection bias if all those who are randomised are retained in the analysis within the groups they were originally allocated. • If we lose participants or the analyst moves participants out of their original randomised groups, this violates the randomisation and can introduce selection bias.

Is it randomised? • “The students were assigned to one of three groups, depending on how revisions were made: exclusively with computer word processing, exclusively with paper and pencil or a combination of the two techniques.” Greda and Hannafin, J Educ Res 1992;85:144.

The ‘Perfect’ Trial • Does not exist. • All trials can be criticised methodologically, but is best to be transparent about trial reporting so we can interpret the results in light of the quality of the trial.

Types of randomisation • Simple randomisation • Stratified randomisation • Matched design • Minimisation

Simple randomisation • Use of a coin toss, random number tables. • Characteristics: will tend to produce some numerical imbalance (e.g., for a total n = 30 we might get 14 vs 16). Exact numerical balance unlikely. For sample sizes of <50 units is less efficient than restricted randomisation. However, more resistant to subversion effects in a sequentially recruiting trial.

Stratified randomisation • To ensure known covariate balance restrictions on randomisation are used. Blocks of allocation are used: ABBA; AABB etc. • Characteristics: ensures numerical balance within the block size; increases subversion risk in sequentially recruiting trials; small trials with numerous covariates can result in imbalances.

Matched Designs • Here participants are matched on some characteristic (e.g., pre-test score) and then a member of each pair (or triplet) are allocated to the intervention. • Characteristics: numerical equivalence; loss of numbers if total is not divisible by the number of groups; can lose power if matched on a weak covariate, difficult to match on numerous covariates; can reduce power in small samples.

Minimisation • Rarely used in social science trials. Balance is achieved across several covariates using a simple arithmetical algorithm. • Characteristics: numerical and known covariate balance. Good for small trials with several important covariates. Increases risk of subversion in sequentially recruiting trials; increases risk of technical error.

Characteristics of a rigorous trial • Once randomised all participants are included within their allocated groups. • Random allocation is undertaken by an independent third party. • Outcome data are collected blindly. • Sample size is sufficient to exclude an important difference. • A single analysis is prespecified before data analysis.

Problems with RCTs • Failure to keep to random allocation • Attrition can introduce selection bias • Unblinded ascertainment can lead to ascertainment bias • Small samples can lead to Type II error • Multiple statistical tests can give Type I errors • Poor reporting of uncertainty (e.g., lack of confidence intervals).

Are these RCTs? • “We took two groups of schools – one group had high ICT use and the other low ICT use – we then took a random sample of pupils from each school and tested them”. • “We put the students into two groups, we then randomly allocated one group to the intervention whilst the other formed the control” • “We formed the two groups so that they were approximately balanced on gender and pretest scores” • “We identified 200 children with a low reading age and then randomly selected 50 to whom we gave the intervention. They were then compared to the remaining 150”.

Examples • “Of the eight [schools] two randomly chosen schools served as a control group”[1] • “From the 51 children… we formed 17 sets of triplets…One child from each triplet was randomly assigned to each of the 3 experimental groups”[2] • “Stratified random assignment was used in forming 2 treatment groups, with strata (low, medium, high) based on kindergarten teachers’ estimates of reading”[3] 1 Kim et al. J Drug Ed 1993;23:67. 2 Torgesen et al, J Ed Psychology 1992;84:364 3 Uhry and Shepherd, RRQ, 1993;28:219

What is the problem here? • “A random-block technique was used to ensure greater homogeneity among the groups. We attempted to match age, sex, and diagnostic category of the subjects. The composition of the final 3 treatment groups is summarized in Table 1.” Roberts and Samuels. J Ed Res 1993;87:118.

Stratifying variables Plus 3 groups for each bottom cell = 24 groups in all, sample size = 36

Blocking • With so many stratifying variables and a small sample size then blocked allocation results in on average 1.5 children per cell. It is likely that some cells will be empty and this technique can result in greater imbalances than less restricted allocation.

Mixed allocation • “Students were randomly assigned to either Teen Outreach participation or the control condition either at the student level (I.e., sites had more students sign up than could be accommodated and participants and controls were selected by picking names out of a hat or choosing every other name on an alphabetized list) or less frequently at the classroom level” Allen et al, Child Development 1997;64:729-42.

Is it randomised? • “The groups were balanced for gender and, as far as possible, for school. Otherwise, allocation was randomised.” Thomson et al. Br J Educ Psychology 1998;68:475-91.

Class or Cluster Allocation • Randomising intact classes is a useful approach to undertaking trials. However, to balance out class level covariates we must have several units per group (a minimum of 5 classes per group is recommended) otherwise we cannot possibly balance out any possible confounders.

What is wrong here? • “the remaining 4 classes of fifth-grade students (n = 96) were randomly assigned, each as an intact class, to the [4] prewriting treatment groups;” Brodney et al. J Exp Educ 1999;68,5-20.

Misallocation issues • “We used a matched pairs design. Children were matched on gender and then 1 of each pair was then allocated to the intervention whilst the remaining child acted as a control. 31 children were included in the study: 15 in the control group and 16 in the intervention.” • “23 offenders from the treatment group could not attend the CBT course and they were then placed in the control group”.

Attrition • Rule of thumb: 0-5%, not likely to be a problem. 6% to 20%, worrying, > 20% selection bias. • How to deal with attrition? • Sensitivity analysis. • Dropping remaining participant in a matched design does NOT deal with the problem.

What about matched pairs? • We can only match on observable variables and we trust to randomisation to ensure that unobserved covariates or confounders are equally distributed between groups. • If we lose a participant dropping the matched pair does not address the unobservable confounder, which is one of the main reasons we randomise.

Matched Pairs on Gender

Drop-out of 1 girl

Removing matched pair does not balance the groups!

Dropping matched pairs • In that example by dropping the matched pair we make the situation worse. • Balanced on gender but imbalanced on high/low; • We can correct for gender in statistical analysis as it is observable variable: we cannot correct for high/low as this is unobservable; • Removing the matched pair reduces our statistical power but does not solve our problem.

Sensitivity analysis • In the presence of attrition we can see if our results change because of this. For example, for the group that has a good outcome, we can give the worst possible scores to the missing participants and vice versa. • If the difference still remains significant we can be reassured that attrition did not make a difference to the findings.

Flow Diagrams Hatcher et al. 2005 J Child Psych Psychiatry: online

Flow Diagram • In health care trials reported in the main medical journals authors are required to produce a CONSORT flow diagram. • The trial by Hatcher et al, clearly shows the fate of the participants after randomisation until analysis.

Poorly reported attrition • In a RCT of Foster-Carers extra training was given. • “Some carers withdrew from the study once the dates and/or location were confirmed; others withdrew once they realized that they had been allocated to the control group” “117 participants comprised the final sample” • No split between groups is given except in one table which shows 67 in the intervention group and 50 in the control group. 25% more in the intervention group – unequal attrition hallmark of potential selection bias. But we cannot be sure. Macdonald & Turner, Brit J Social Work (2005) 35,1265

Recent Blocked Trial “This was a block randomised study (four patientsto each block) with separate randomisation at each of the threecentres. Blocks of four cards were produced, each containing twocards marked with "nurse" and two marked with "house officer."Each card was placed into an opaque envelope and the envelopesealed. The block was shuffled and, after shuffling, was placedin a box.” Kinley et al., BMJ 325:1323.

What is wrong here? Kinley et al., BMJ 325:1323.

Type I error issues • 3 group trial - “Pre-test to posttest scores improved for most of the 14 variables”. 42 potential comparisons between pairs. Authors actually did more reporting pretest posttest one group tests as well as between groups, which gives 82 tests. Roberts and Samuels. J Ed Res 1993;87:118.

Type II errors • Most social science interventions show small effect sizes (typically 0.5 or lower). To have 80% chance of observing a 0.5 effect of an intervention we need 128 participants. For smaller effects we need much larger studies (e.g., 512 for 0.25 of an Effect Size).

Analytical Errors • Many studies do the following: • Do paired tests of pre post tests. Unnecessary and misleading in a RCT as we should compare group means. • Do not take into account cluster allocation. • Use gain scores without adjusting for baseline values. • Do multiple tests.

Pre-treatment differences • A common approach is to statistically test baseline covariates: • “The first issue we examined was whether there were pretreatment differences between the experimental groups and the control groups on the following independent variables” “There were two pretreatment differences that attained statistical significance” “However, since they were statistically significant these 2 variables are included as covariates in all statistical tests”. Davis & Taylor Criminology 1997;35:307-33.

What is wrong with that? • If randomisation has been carried out properly then the null hypothesis is true, any differences have occurred by chance. • Statistical significance of differences gives no clue as to the importance of the covariate to be included in the analysis. Including a significant covariate, which is unimportant reduces power whilst ignoring a balanced covariate also reduces power.

The CONSORT statement • Many journals require authors of RCTs to conform to the CONSORT guidelines. • This is a useful approach to deciding whether or not trials are of good quality.

Modified CONSORT quality criteria Was the study population adequately described? (i.e. were the important characteristics of the participants described e.g. age, gender?) Was the minimum important difference described? (i.e. was the smallest clinically important effect size described?) Was the target sample size adequately determined? Was intention to treat analysis used? Was the unit of randomisation described (i.e. individuals or groups)? Were the participants allocated using random number tables, coin flip, computer generation? Was the randomization process concealed from the investigators? Were follow-up measures administered blind? Was estimated effect on primary and secondary outcome measures stated? Was precision of effect size estimated (confidence intervals)? Were summary data presented in sufficient detail to permit alternative analyses or replication? Was the discussion of the study findings consistent with the data?

Review of Trials • In a review of RCTs in health care and education the quality of the trial reports were compared over time. Torgerson CJ, Torgerson DJ, Birks YF, Porthouse J. Br Ed Res J. 2005;31:761-85.

Study Characteristics

Change in concealed allocation P = 0.04 P = 0.70 NB No education trial used concealed allocation

Blinded Follow-up P = 0.03 P = 0.54 P = 0.13

Underpowered P = 0.22 P = 0.01 P = 0.76

Mean Change in Items P= 0.03 P= 0.001 P= 0.07

Summary • A lot of evidence from health care trials that poor quality studies give different results compared with high quality studies. • Social science trials tend to be poorly reported. Often difficult to distinguish between poor quality and poor reporting. • Can easily increase reporting quality.

Reading and reporting evidence from trial-based evaluations