Difficulties in analysing non-randomised trials (…and ways forward?)

Difficulties in analysing non-randomised trials (…and ways forward?) RCTs in the Social Sciences: challenges and prospects. York University, 13-15 Sept. 2006 Paul Marchant Leeds Metropolitan University p.marchant@leedsmet.ac.uk (Paul Baxter from Department of Statistics, University of Leeds is involved in developing some of this work)

The Basic Point • My thoughts, • If Non_RCTs are used, we need a good understanding of the system being studied and a quantitative model to work out what is lost and what the effect is. • The effects being sought may be small so impact of small systematic errors can be important. • Probably best just to use RCTs, especially when policy implications are costly.

The problem • In crime research there is a 5 point ‘Maryland Scientific Methods Scale’ which orders trial designs (RCT is the top ) • While the ordering may be fine there is no formal indication of what is lost by using a 4 rather than a 5. • A large potential exists it would seem of drawing false inference.

The Randomised Controlled Trial(A truly marvellous scientific invention) Population Note to avoid ‘bias’: • Allocation is best made tamper-proof. (e.g. use ‘concealment’) • Use multiple blinding of: • patients, • physicians, • assessors, • analysts … Take Sample Randomise to 2 groups Old Treatment New Treatment Compare outcomes (averages) recognising that these are sample results and subject to sampling variation when applying back to the population

Counts of those cured and not cured under the two treatments By comparing the ratios of numbers ‘cured’ to ‘not cured’ in the 2 arms of the trial, the CPR= (ad)/(cb), it is possible to tell if the new treatment is better.

Confidence Intervals • However there is sampling variability, because we don’t study everybody of interest; just our random sample. • So cannot have perfect knowledge of the effect of interest, but only an estimate of it within a confidence interval (CI). • Need to know how to calculate the CI appropriately. This can be done under assumptions, which seem reasonable for the case of a clinical RCT and leads to a simple formula for the approximate CI (+/-1.96 standard error) of ln(CPR) (s.e. (ln(CPR)) )2= Var(ln(CPR)) = 1 + 1 + 1 + 1 a b c d

Crime counts before and after in two areas one gets a CRI (4 on the Methods Scale) A similar table results. But this is not the same as the RCT set up as: 1 Not randomised, so no statistical equivalence exists at the start. 2 The unit is area, rather than crime event.

Lighting andcrime There seem to be many ‘theoretical suggestions’ why lighting might increase or decrease crime. The meta-analysis, HORS251, by Farrington and Welsh suggests strongly that lighting beats crime. However my contention is that this study remains flawed and so we are ignorant of the effect of lighting on crime. (Note also HORS252 on CCTV)

Forest Plot as HORS 251 Meta-analysisreconstructed

But this can’t be right. • The assumptions for calculating the CIs cannot be correct, in this case. Unit is area not crime. The events are not statistically independent. • Too much variation (heterogeneity) exists between individual study results compared with the uncertainty indicated by confidence intervals, (if the lighting has the same effect on crime in every study). • Note there is great variation in crime counts between periods in the comparison areas, where nothing is changed, so the heterogeneity is inherent to the natural variation of crime.

Pointing out the problem • Marchant (2004), 7 page article in the British Journal of Criminology drawing attention to the problem. The formula for the CIs used must be inappropriate (also mentioning other short-comings). • The authors of HORS251 had 20-page response on the next page, justifying the claim that lighting reduces crime. • But I remain unconvinced by the claim.

Fixing the Heterogeneity Problem • A way of making the problem go away is simply to increase the uncertainty, i.e. stretch the CIs . (‘A quasi-Poisson model’). • Here the CIs are stretched by a factor of 2.1. (Equivalent to reducing the events counted in every setting by a factor 2.12 = 4.4. ). This adjustment has been made by the authors. • Problem solved.... or is it? Is such model plausible? Assumes every study should have its CI stretched by the same factor. This cannot be guaranteed. • Only relatively few (13) studies. • Need sensitivity analysis

Time Variation in Crime • It appears that little is known about how crime varies on various scales. • Much more needs to be known about the occurrence of crime events to know how to analyse them properly to be able find effects. • Need access to suitable data sets to examine this issue. This is on going research in which myself and colleagues are engaged. • A general point: one needs to have knowledge about the system in order to understand if an intervention changes things. (And in order to design studies)

The Bristol Study (Shaftoe 1994) Shaftoe said ‘no discernable lighting benefit’ but HORS251 said z=6.6 Note: had the data for the year immediately prior to the introduction of the relighting, i.e. periods 2 and 3, been used rather than unnaturally using periods 1 and 2 which leaves a gap of ½ year, the effect found would have been half of that claimed. (Shows large variability.)

Household studies • In a couple of instances, instead of just counting recorded crimes a, b, c, d in the 4 cells (before, after, intervention, comparison), a household survey before and after of recalled crimes within the 2 areas (intervention, comparison) is carried out. • One problem is that (unrecognised by authors Painter and Farrington) spatial correlation between the occurrence of crime needs to considered. Gives rise to a Design Effect familiar in clustered designs. Reduces the precision of the estimate of effect. • Other problems, e.g. of differential change of composition between periods.

Lack of Equivalence between Areas Invariably it is the most crime-ridden area that gets the lighting, whereas the relatively crime-free ‘control’ area is not re-lit. So there is lack of equivalence at the start. One effect of this is to allow ‘regression towards the mean’ to operate. The name ‘Control Area’ is a misnomer. ‘Comparison Area’ is a better name.

100 Line of Equality Regression towards the mean Line of mean of Y for a given X Cloud of Data Points 50 Y The after measurement 0 0 50 100 X The before measurement

The response given to the lack of equivalence between the 2 areas. (RTM) • Farrington and Welsh (2006) claim that RTM is a not problem because the effect in counted crimes in 250 Police ‘Basic Command Units’ going from 2002/3 to 2003/4 showed only small effect (a few %). This is hardly surprising as the areas and hence the number of crimes counted are an order of magnitude larger than in HORS251 so the year to year correlation is expected to be higher than for the small lighting study areas. • Note Wrigley (1995) “This tendency for correlation coefficients to increase in magnitude as the size of the areal unit involved increases has been known since the work of Gehlke and Biehl (1934)”.

Log crime rates in successive periods

Estimating the effect of RTM On the basis of log normal crime rates it can be shown that if the intervention has no effect, the expected ln CPR = (1-ρσy/σx) ln x1/x2 x1/x2 is the crime rate ratio; σx, σy the sds on the log scale and ρ the correlation on the log scale variance ln CPR = 2 σy2(1-ρ2)

Estimation of the effect of RTM • The simple model of crime rates suggests that the high year to year correlation typically 0.95 for the BCU data, would indeed give an effect of a few %. • However the smaller areas used in CRI evaluation would be expected to have lower correlation • Burglary data from a study of 124 areas has correlation of about 0.8 giving, all else equal, an expected effect 4 times larger comparable to the claimed lighting effect. • Note: in general we don’t know the correlation nor rates being compared for the lighting studies. However, we do know, whereas the household crime rate ratio at the start was 1.40 for Dudley, that for Stoke was 2.51 giving a much larger expected RTM effect. • Without better knowledge we can’t be definite about the impact of RTM but the indications are that the bias could be serious and uncertainty large.

Expected natural log of CPR and its CI for a set of burglary data.

Potential consequences of weak methods • Because there is a tendency to find ‘positive effects’ and probably even more so with less rigorous work, one is likely to end up with an even more distorted research record. • This might lead dubious justification through flimsy cost benefit analyses justifying a bad policy. • While it might be possible to estimate the effect of the excess variability or the effect of RTM discussed, it would seem problematic to be confident about adequately adjusting for them. • RCTs would avoid many problems and may be very cheap relative to policy costs.

Some conclusions • A ‘Methods Scale’ seems to suggest that designs weaker than RCTs might suffice, without indicating what is lost. • I have indicated some of the problems which result. • Need to ‘foster scepticism’ (Gorard 2002) • I remain to be convinced that the deficiencies can be adequately overcome through estimating quantitatively the consequences of using a weaker design. • Weaker designs might be useful in preliminary research but should not be considered as adequate when there are expensive consequences. • RCTs can be problematic enough! (We need registered trials, published protocols, blinding etc…..) • Evaluations of policies need to be done to a high scientific standard.

References Farrington D.P. and Welsh B.C. (2002) The Effects of Improved Street Lighting on Crime: A Systematic Review, Home Office Research Study 251, http://www.homeoffice.gov.uk/rds/pdfs2/hors251.pdf Farrington D.P. and Welsh B.C. (2004) Measuring the Effects of Improved Street Lighting on Crime: A reply to Dr. Marchant The British Journal of Criminology44 448-467 http://bjc.oupjournals.org/cgi/content/abstract/44/3/448 Farrington D.P. and Welsh B.C. (2006) How Important is Regression to the Mean in Area-Based Crime Prevention Research?, Crime Prevention and Community Safety 8 50 Gorard S (2002) Fostering Scepticism: The Importance of Warranting Claims, Evaluation and Research in Education 16 3 p136 Marchant P.R. (2004) A Demonstration that the Claim that Brighter Lighting Reduces Crime is Unfounded The British Journal of Criminology44 441-447 http://bjc.oupjournals.org/cgi/content/abstract/44/3/441

References continued Marchant P.R. (2005) What Works? A Critical Note on the Evaluation of Crime Reduction Initiatives, Crime Prevention and Community Safety 7 7-13 Painter, K. and Farrington, D. P. (1997) The Crime Reducing Effect of Improved Street Lighting: The Dudley Project, in R.V. Clarke ed., Situational Crime Prevention: Successful case studies 209-226 Harrow and Heston, Guilderland NY. Shaftoe, H (1994) Easton/Ashley, Bristol: Lighting Improvements, in S. Osborn (ed.) Housing Safe Communities: An Evaluation of Recent Initiatives 72-77, Safe Neighbourhoods Unit, London Tilley N., Pease K., Hough M. and Brown R. (1999) Burglary Prevention: Early Lessons from the Crime Reduction Programme, Crime Reduction Research series Paper1 London Home Office Wrigley N., Revisiting the Modifiable Areal Unit Problem and Ecological Fallacy pp49-71 in Gould PR, Hoare AG and Cliff AD Eds Diffusing Geography: Essays for Peter Haggett

The RTM problem • The effect of RTM depends on the correlation (the weaker, the bigger) and increases with the size of the initial difference between groups. • Authors attempt to justify no RTM concern with large area crime data which shows only a small RTM effect. But this is wrong, as correlation won’t be as high in the smaller areas used in the trials. We also don’t know the rates in the areas in general for the 2 we do. They are quite different. (1.4X and 2.5X)

Difficulties in analysing non-randomised trials (…and ways forward?)