Experimental / Survey Designs

Experimental / Survey Designs "Truth is sought for its own sake… Finding the truth is difficult, and the road to it is rough. ..." (Ibn Al-Haytham965 – 1039, a pioneer of scientific methods) Peter Shaw

First: Risk Assessments Not academic, but utterly vital! H&S law is clear that you can do quite a lot of dangerous-sounding things, providing that you have undertaken a risk assessment and found them to be acceptably safe. This requires you to do a risk assessment – for just about anything! The risk framework requires you to specify what you are doing, how you are doing it, what the worst case scenario is, and what you will do under that scenario. You must complete this with a supervisor!

Introduction • Without a good design, a project is worthless • Rothamsted had to throw away 50 years worth of meticulously collected data from an experiment because the design was faulty • No amount of analysis or literature review can save a bad design - this is utterly crucial Sir R. A. Fisher, "to call in a statistician after the experiment is done may be no more than asking him to perform a post-morten examination: he may be able to say what the experiment died of".

‘…finding the truth is difficult…’ – ‘Clever Hans’ • Owner William von Osten, started displaying Hans in 1891 • Oskar Pfungst solved the mystery in 1907

People who don’t like designed experiments: Many people don’t like properly designed trials. These are quacks, fraudsters, snake oil merchants and generally deserve holding in contempt. The archetypal example is a vitamin salesman Matthias Rath, who made a lot of money by persuading the south African government to scrap white western anti-retrovirals and instead give HIV victims his vitamin mixes. He sued the Guardian when Ben Goldacre brought this to public attention (and, happily, lost with a million pound legal bill).

The evidence that homeopathy works is quite good, as long as you compare it to leaving the patient alone unattended. When comparing with a placebo, homeopathy fails to show any benefit. Homeopaths have taken to avoiding randomisd trials, saying pathetic things like ‘homeopathy is unsuited for randomised trials’. (In other words, it’s a placebo fraud). Likewise chiropractors, and the college of chiropractors sued Simon Singh for pointing out that there is no evidence that chiropractics works, less that it can cure colic, ear infections or asthma. Singh won his appeal Oct 2009, thankfully.

Always remember: H0 No evidence No belief!!

You intervene: You observe: Tend to use anova Tend to use correlation/ regression

The media habitually get this wrong, and deserve to get picked up more. Correlation does not establish causality! Just because X co-occurs with Y does not mean that X causes Y, nor vice versa. Under a prohibition model, prohibited drugs are inevitably going to be associated with organised crime. Politicians then see the association between drugs and crime and use that to justify the prohibition; the correlation is valid but the causality is not: the argument is circular hence invalid. You establish, by dietary survey and health records, that there is a clear trend for people who only eat organic food to have better health (after correcting for age etc) than people who never do. Does this prove that organic veg are good for you?

4 Golden principles to make an experimental design work • Control • Balance • replication • Randomness

Control • This requires you to have a group of samples / subjects which are not subjected to the treatment of interest • Often in experiments this is easy: Don’t add fertiliser, don’t add lead etc • Is actually linked at a deep level with your H0 and can conceal serious pitfalls

Automatically, a design without a control is unable to ascertain the effects of an intervention. I blocked a PhD here some years ago who sought to use hypnotism to improve athletic performance. The subject area would have been fine for a PhD, except that for some reason (that I am still at a loss to explain) they insisted on not needing a control group. They’ve still not got a PhD, and haven’t talked to me since. Similarly, I have seen a disturbing number of Psych.D. applications be approved here whose model to investigate subject area X is to interview 5 practitioners who have experience of area X. This is, of course, utterly worthless for establishing whether any of their interventions actually did any good. (I’ve heard a homeopath talking about her successes – even though her profession is based on the placebo effect.)

Problematic control situations • A Russian experiment added live earthworms to a tundra soil – there was a growth spurt compared to worm-free controls • Why? Worms died and rotted, releasing N/P/K. • Needed a second control, with dead worms. • Similarly, experiments adding nitrogen to rainwater need a 2 controls: • natural rain • Simulated rain with no nitrogen

John Cade explored the effects of uric acid on manic depression. Around 1940 he explored the application of a soluble salt (uric acid is tricky to dissolve) – he chose Lithium ureate. This indeed proved effective in calming bipolar disorder, against a control of no intervention. Did this prove uric acid was the answer? He had the good scientific sense to include a second control of lithium carbonate (now called “the vehicle”), which had exactly the same effect. In 1949 this was published, revolutionising psychiatry.

I’ve a design that needs no control. (Really?) Real example from a project: Miss X. wants to test the effect of caffeine on student alertness. She gives students an alertness test, then a cup of coffee, than a 2nd test. The difference between the 2 tests has H0: random noise, and if H0 is rejected that shows caffeine has an effect – no control needed. Where’s the flaw?!

Miss N wants to get a dose-response curve for the bactericidal power of ‘Dettol’. She added dettol at 3 concentrations to bacterial cultures, then measured the turbidity of each culture as a measure of cell density. Here’s the result: High medium low So the broth was unexpectedly cloudy with a high conc. of dettol. How does this link to the need for a control?

Control • Needs thought with some factors: temperature, pH • The concept needs modification for surveys, esp. sociological ones • use a contrast (M/F, etc) • find a steady gradient and follow it

Replication • Never just sample anything once! • A minimum # reps is 3 • The more replicates the better, but check your time/equipment budgets. • 80 soil samples = 2 weeks • There are deep, nasty traps here. Under certain H0s plants is the same pot are not replicates, nor are people in the same family (or even school - depends on H0) • Check design with me if there is any doubt!

Balance • Analyses are more reliable + powerful if you have equal sample numbers in each of your sample classes • Some analyses are impossible without perfect balance • Hence it is NOT a waste of time to devote 1/2 effort to an untreated control

Randomisation If you don’t randomise you cannot make a link between cause and effect. (Or rather you can, but it may not be safe). True example: Councils use speed cameras to catch speeding motorists, and wanted to use them to reduce death rates at accident black spots. An experiment was set up in conjunction with DoT to add speed cameras to selected bad junctions. They decided to identify the 2000 WORST accident black spots in the study area, and added speed cameras to them. 18 months later these showed that there was a 60% reduction in death rate. This design is so deeply flawed that the study was barely worth performing, certainly not the £100,000 of public money involved. Why?

The study was flawed because they selection of locations was not random: they chose the worst accident black spots. H0: No systematic differences, variation is random noise. This implies that the worst black spots, next time, will simply revert closer to the population mean, whatever intervention is used. I created 2 columns of random numbers of the same mean and sd, then identified the 5 highest values in column 1. I then graphed these pre-intervention data with their matched random values in column 2 (“post-intervention”). What happened?

Regression to the mean: why you mustn’t pick out the ‘best’ sites to study. A stunning difference, significant at p<0.01 But this is random noise – with a non-random sampling method.

The Lanarkshire milk trials, 1930 Pre WW2 we had kids who really went hungry, and there was a real concern that maybe additional milk would help them. The Lanarkshire milk trials were run to examine this. Teachers in Lanarkshire were given enough milk for half their class and told to give it to a random half of the kids. Would you give milk to an overweight boy while a genuinely hungry, genuinely malnourished girl looked on enviously, every day for weeks? No, the teachers didn’t either. NOTHING AT ALL could be concluded from any data collected from this design once created and no useful conclusions have ever be drawn from it. The allocation of children should have been randomised.

This example is also from education, and concerns a ‘cure’ for dyslexia. It was heavily promoted by its millionaire founder Wynford Dore, calling it the ‘Dore Method’. It was supported by Richard and Judy, and Trevor Macdonald on newsnight. It even published a paper in a Dyslexia journal in 2006 showing how well it worked, but 9 experts objected to the paper on the grounds the that trial was not randomised. Parents paid £2000 per child per course. It collapsed in May 2008, having never been validated by a randomised trial. http://www.guardian.co.uk/science/2006/nov/04/badscience.uknews http://www.dore.co.uk/miraclecure.aspx http://www.timesonline.co.uk/tol/news/uk/health/article4022998.ece

Randomness • Underlying any analyses you perform is the assumption of random sampling • With soil / veg samples use random number tables to define a coordinate • People are more difficult - by definition people stopping to talk to you are not a random sample • How to mix this with a systematic survey along (say) a transect? Use STRATIFIED RANDOM sampling.

Practicalities I prefer to avoid the nitty-gritty of methodologies in this lecture, for the obvious reasons that methodologies are subject-specific. There are clear cases where a methodological flaw ruins an analysis, eg the researcher who claimed that coca cola caused brain bleeding in experimental animals. Under cross-questioning it turned out the animals had been killed by being hit in the head. The warning is more for your coursework assignments: these are real examples from submitted work. Q: Design a survey to investigate whether eggshell thickness in birds varies in relation to soil calcium levels. A:“I would identify 50 chicken farms and collect every egg they produce for two weeks continuously. I would collect 1 sample of soil for analysis.” (How many cubic metres of eggs?)

Q: Design a survey to investigate the influence of human trampling on plant species in a downland sward. A:“The trampling treatment would involve the turf being trampled continuously for 12 hours a day. This would carry on for a year”. (Call that a job opportunity?) Q: Design an experiment to investigate the influence of weed and slug control on plant yield in garden plots. A: “In the weeded treatments the lettuces would be sprayed with weedkiller every day” (No weeds, but don’t you think there might be some collateral damage?)

Your turn… • You design an experiment with balance, replication and multiple treatments. • Surname range • Fertilisers on lawn growth A-D • Slug pellets on lettuce yield D-I • Effect of soil salt on plants (eg roadside spp) J-O • Coffee on human blood pressure P-S • Dietary lead shot on ducks (measure their blood [Pb] ) T-V • Sleep deprivation on human reaction time W-Z

Controlling extraneous variation Sometimes you know that there is inherent variability in your sample population (more -> less fertile soils, smaller-> larger animals, poorer-> richer patients), and your statistical power (=ability to discern an effect of the treatment when compared against unexplained random noise in your data). This leads on to randomised block designs. You identify 2+ sections (areas, size categories) and impose the balanced randomised design equally in all of them. rich soil medium soil Poor soil

Latin squares A latin square is a square layout experiment where all rows and all columns have the same treatments, but in a randomised order. A lot like a sodoku! Note that any permutation of this in which whole rows or whole columns are moved en bloc will remain a valid latin square. (Think about it!) Eg:

Factorial designs: • These are the most powerful design available, and allow one to probe the effects of multiple treatments and their interactions. • The essence of these is that whatever treatments are imposed, all combinations of treatments are imposed with the same level of replication. • You want to study fertiliser and pesticide on plant yield. The basic design is +/- fertiliser and +/- pesticide, each replicated (say) 4 times. This design is 2*2*4 reps = 16 plots. How many plots needed for 3 *3 with 5 reps?

Surveys: Random, stratified random, and transects To survey populations in a site you go out (having completed the H&S paperwork!) and collect samples. But what design? Ad-hoc: grabbing material as is convenient. +: Easy, quick OK for a preliminary scoping exercise, -: essentially unusable as the method cannot readily be repeated. Random: Set up a coordinate system and sample from random coordinates. +: utterly objective and unbiased, readily repeated. -: you may well miss sampling from strata or sites that really interest you.

Stratified random: Here you identify strata within your site (could be specific habitat zones, or plants of a particula height range, or any other convenient subdivision), then sample randomly with equal sampling effort in each stratum. (Remember Balance?) +: This is an excellent compromise, allowing directed randomness! You select the variation which you believe matters, but still mainatin some objectivity. -: no big drawbacks, but the choice of strata is subjective and can bias the outcome. Transects: Here you draw a line along a region of variation that interests you and sample at regular intervals. +: This is the best tool to quantify spatial variation such as edge effects, oxygen deficit curves etc -: successive points along a transect are not independent of each other, and the correct way to handle the data varies according to what question you are asking. Formally 1 transect = 1 observation, however many points it contains.

You design a survey on one of the following topics. The survey should include balance and replication. • Christian name range Distribution of gorilla nests in relation to habitat features A-D Attittudes to conservation in two contrasting populations D-I Levels of damage caused by crop-raiding animals J-O Gibbon population densities in two contrasting habitats P-S Aquatic pollution downstream of a silage leak T-Z

Replication and pseudo-replication The horror stories come when the design and the hypothesis don’t match up. This is easier than you might think – here we meet the demon of pseudoreplication. When you quote an F value you must specify its DFs. Always, the more DFs you have the lower your F value needs to be in order to be p<0.05. Thus F3,2 =6.0 is non-significant but F 3,20 = 6.0 is very significant (p<0.001). Pseudoreplication is the error created by getting thise DFs wrong. The F value is OK (usually!), but its significance is incorrect.

Yawn…but hang on a minute! Pseudoreplication led to one lecturer here having to spend 18 months of his life re-writing his entire PhD thesis. And re-plot his graphs!! Pseudoreplication led to another colleague (Ian Rothertham) having to re-do bits of his thesis. As an editor of a research journal, the commonest statistical reason I have for sending papers back to be re-analysed is pseudoreplication. This is a serious statistical sin! It’s rather equivalent to lying on oath about your DFs.

Independence Formally, the key problem comes down to the notion of independence. One of the assumptions of ANY statistical test is the the observations being presented may be assumed to be independent of each other. Re-offering the same data over and over again does not reinforce its value. I can measure your blood pressures 100 times each and enter these 200 datapoints into Excel. I can do an ANOVA showing that the drug makes a big difference - or can I? Person A + Drug Person B -Drug

Fertile sources of pseudo-replication: Time series. (Number 1 offender). 1000 repeat observations = 1 DF! Repeated re-sampling from one site to test a general hypothesis. (Ubiquitous in field ecology, unavoidable). Leaves within 1 tree – are independent tests on H0: no difference BETWEEN 2 OR MORE NAMED INDIVIDUAL TREES but for H0: response to treatment (eg fertiliser treatment) 1 tree = 1df, and you MUST use one value (usually a mean) per tree. Plants in pots are NOT independent estimates of a treatment applied to that pot.

Pseudo-replication and leopard growls An example of how science progresses furthest by ever-deeper cynicism. The story is as follows: When some monkeys hear a leopard growl they emit a distinct alarm call. In other cases they make a noise, but this is a more general noise that they make at other times too. So the experiment is to record monkeys’ vocalisations in the field, and play them a leopard growl. Is that an experiment? How about a control?? So there was a control loud noise (a tree falling over), or a taped leopard growl, and the confirmation of a difference between the two was taken as evidence of to show a general response to a leopard’s call. Actually, as a referee pointed out while rejecting the paper, this is pseudo-replicated for H1: a general response to leopards. You have shown a response to THAT ONE TAPE RECORDING. To generalise you must have multiple leopard tapes.

Experimental / Survey Designs