How to do Experiments: Empirical Methods for AI & CS

How to do Experiments: Empirical Methods for AI & CS Paul Cohen Ian P. Gent Toby Walsh cohen@cs.umass.edu ipg@dcs.st-and.ac.uk tw@4c.ucc.ie

Empirical Methods for CS Can you do Empirical AI?

Can you do empirical AI? • See if you can spot a pattern in the following real empirical data • (326 dp (T 1 0)) • (327 dp (T 1 0)) • (328 dp (T 1 0)) • (329 dp (T 1 0)) • (330 dp (T 1 0)) • (331 dp (T 2 1)) • (332 dp (T 1 0)) • (333 dp (T 1 0)) • (334 dp (T 3 2)) • (335 dp (T 350163776 62)) • This led to an Artificial Intelligence journal paper • Gent & Walsh, “Easy Problems are Sometimes Hard”, 1994

Experiments are Harder than you think! • That pattern was pretty easy to spot but… • To see the pattern you have to not • kill the experiment in the middle of its run • assuming that the pipe to the output had got lost! • That’s what I did, but fortunately the effect occurred again. • that instance took a week to run

Can you do Empirical AI? yes! Experiments are Harder than you think! What are empirical methods? Experiment design Some Problem Issues Data analysis & Hypothesis Testing Summary Overview of Tutorial

How not to do it Case Study Gregory, Gao, Rosenberg & Cohen Eight Basic Lessons The t-test Randomization Supplementary Material

Our objectives • Outline some of the basic issues • exploration, experimental design, data analysis, ... • Encourage you to consider some of the pitfalls • we have fallen into all of them! • Raise standards • encouraging debate • identifying “best practice” • Learn from your & our experiences • experimenters get better as they get older!

Empirical Methods for CS Experiments are Harder than you think!

Experiments are Harder than you think! • Flawed problems: • A case study from Constraint Satisfaction • 40+ experimental papers over 5 years • papers on the nature of hardrandom CSPs • Authors include … (in alphabetical order!) • Fahiem Bacchus, Christian Bessiere, Rina Dechter, Gene Freuder, Ian Gent, Pedro Meseguer, Patrick Prosser, Barbara Smith, Edward Tsang, Toby Walsh, and many more • Achlioptas et al. spotted a flaw • asymptotically almost all problems are trivial • brings into doubt many experimental results • some experiments at typical sizes affected • fortunately not many

Flawed random constraints? • e.g. “Model B”, domain size d. Parameters p1 and p2 • Pick exactly p1C constraints (if there are C possible) • For each one • pick exactly p2d2 pairs of values as disallowed • e.g. d=3, p2=4/9 • Constraints C1 & C2 • C2 is flawed • it makes X=2 impossible • For any p2 ≥ 1/d, p1 > 0 • as n  ∞, there will always be one variable with all its values removed • asymptotically, all problems are trivial!

Flawless random problems • [Gent et al.] fix flaw …. • introduce “flawless” model B • choose d squares which must always be allowed • all in different rows & columns • choose p2d 2 X’s to disallow in other squares • For model B, I proved that these problems are not flawed asymptotically • any p2 < ½ • so we think that we understand how to generate random problems

But it wasn’t that simple… • Originally we had two different definitions of “flawless” problems • An undergraduate student showed they were inequivalent! • after paper about it on the web • (journal paper reference follows is correct )

Experiments are harder than you think! • This tripped up all constraints researchers who thought about it • It concerned the most fundamental part of the experiments • i.e. generating the input data • closely analogous flaw has turned up in SAT and in QBF • The flaw was not found by constraints researchers • fruitful (in the end!) interaction between theory and experiment • experimental method justified theoretically • Even the fix was wrong at first • Most experiments still use “flawed” models • (which is ok if you know what you’re doing: • if you make a positive decision with a good reason )

Further reading • D. Achlioptas, L.M. Kirousis, E. Kranakis, D. Krizanc, M. Molloy, and Y. StamatiouRandom Constraint Satisfaction: A More Accurate Picture,Constraints, 6 (4), (2001), pp. 329-344. • I.P. Gent, E. MacIntyre, P. Prosser, B.M. Smith and T. WalshRandom Constraint Satisfaction: flaws and structures, Constraints,6 (4), (2001), pp. 345-372. • Coincidence of title and publication details not at all coincidental

Empirical Methods for CS What are Empirical Methods?

What does “empirical” mean? • Relying on observations, data, experiments • Empirical work should complement theoretical work • Theories often have holes (e.g., How big is the constant term? Is the current problem a “bad” one?) • Theories are suggested by observations • Theories are tested by observations • Conversely, theories direct our empirical attention • In addition (in this tutorial at least) empirical means “wanting to understand behavior of complex systems”

Why We Need Empirical Methods Cohen, 1990 Survey of 150 AAAI Papers • Roughly 60% of the papers gave no evidence that the work they described had been tried on more than a single example problem. • Roughly 80% of the papers made no attempt to explain performance, to tell us why it was good or bad and under which conditions it might be better or worse. • Only 16% of the papers offered anything that might be interpreted as a question or a hypothesis. • Theory papers generally had no applications or empirical work to support them, empirical papers were demonstrations, not experiments, and had no underlying theoretical support. • The essential synergy between theory and empirical work was missing

Theory, not Theorems • Theory based science need not be all theorems • otherwise science would be mathematics • Consider theory of QED • based on a model of behaviour of particles • predictions accurate to 10 decimal places • (distance from LA to NY to within 1 human hair) • most accurate theory in the whole of science? • success derived from accuracy of predictions • not the depth or difficulty or beauty of theorems • QED is an empirical theory!

Empirical CS/AI • Computer programs are formal objects • so let’s reason about them entirely formally? • Two reasons why we can’t or won’t: • theorems are hard • some questions are empirical in nature e.g. are Horn clauses adequate to represent the sort of knowledge met in practice? e.g. even though our problem is intractable in general, are the instances met in practice easy to solve?

Empirical CS/AI • Treat computer programs as natural objects • like fundamental particles, chemicals, living organisms • Build (approximate) theories about them • construct hypotheses e.g. greedy hill-climbing is important to GSAT • test with empirical experiments e.g. compare GSAT with other types of hill-climbing • refine hypotheses and modelling assumptions e.g. greediness not important, but hill-climbing is!

Empirical CS/AI • Many advantage over other sciences • Cost • no need for expensive super-colliders • Control • unlike the real world, we often have complete command of the experiment • Reproducibility • in theory, computers are entirely deterministic • Ethics • no ethics panels needed before you run experiments

Types of hypothesis • My search program is better than yours not very helpful beauty competition? • Search cost grows exponentially with number of variables for this kind of problem better as we can extrapolate to data not yet seen? • Constraint systems are better at handling over-constrained systems, but OR systems are better at handling under-constrained systems even better as we can extrapolate to new situations?

A typical conference conversation What are you up to these days? I’m running an experiment to compare the MAC-CBJ algorithm with Forward Checking? Why? I want to know which is faster Why? Lots of people use each of these algorithms How will these people use your result?...

Keep in mind the BIG picture What are you up to these days? I’m running an experiment to compare the MAC-CBJ algorithm with Forward Checking? Why? I have this hypothesis that neither will dominate What use is this? A portfolio containing both algorithms will be more robust than either algorithm on its own

Keep in mind the BIG picture ... Why are you doing this? Because many real problems are intractable in theory but need to be solved in practice. How does your experiment help? It helps us understand the difference between average and worst case results So why is this interesting? Intractability is one of the BIG open questions in CS!

Why is empirical CS/AI in vogue? • Inadequacies of theoretical analysis • problems often aren’t as hard in practice as theory predicts in the worst-case • average-case analysis is very hard (and often based on questionable assumptions) • Some “spectacular” successes • phase transition behaviour • local search methods • theory lagging behind algorithm design

Why is empirical CS/AI in vogue? • Compute power ever increasing • even “intractable” problems coming into range • easy to perform large (and sometimes meaningful) experiments • Empirical CS/AI perceived to be “easier” than theoretical CS/AI • often a false perception as experiments easier to mess up than proofs • experiments are harder than you think!

Empirical Methods for CS Experiment design

Experimental Life Cycle • Exploration • Hypothesis construction • Experiment • Data analysis • Drawing of conclusions

Checklist for experiment design* • Consider the experimental procedure • making it explicit helps to identify spurious effects and sampling biases • Consider a sample data table • identifies what results need to be collected • clarifies dependent and independent variables • shows whether data pertain to hypothesis • Consider an example of the data analysis • helps you to avoid collecting too little or too much data • especially important when looking for interactions • *From Chapter 3, “Empirical Methods for Artificial Intelligence”, Paul Cohen, MIT Press

Guidelines for experiment design • Consider possible results and their interpretation • may show that experiment cannot support/refute hypotheses under test • unforeseen outcomes may suggest new hypotheses • What was the question again? • easy to get carried away designing an experiment and lose the BIG picture • Run a pilot experiment to calibrate parameters (e.g., number of processors in Rosenberg experiment)

Types of experiment • Manipulation experiment • Observation experiment • Factorial experiment

Manipulation experiment • Independent variable, x • x=identity of parser, size of dictionary, … • Dependent variable, y • y=accuracy, speed, … • Hypothesis • x influences y • Manipulation experiment • change x, record y

Observation experiment • Predictor, x • x=volatility of stock prices, … • Response variable, y • y=fund performance, … • Hypothesis • x influences y • Observation experiment • classify according to x, compute y

Factorial experiment • Several independent variables, xi • there may be no simple causal links • data may come that way e.g. individuals will have different sexes, ages, ... • Factorial experiment • every possible combination of xi considered • expensive as its name suggests!

Designing factorial experiments • In general, stick to 2 to 3 independent variables • Solve same set of problems in each case • reduces variance due to differences between problem sets • If this not possible, use same sample sizes • simplifies statistical analysis • As usual, default hypothesis is that no influence exists • much easier to fail to demonstrate influence than to demonstrate an influence

Empirical Methods for CS Some Problem Issues

Some problem issues • Control • Ceiling and Floor effects • Sampling Biases

Control • A control is an experiment in which the hypothesised variation does not occur • so the hypothesized effect should not occur either • BUT remember • placebos cure a large percentage of patients!

Control: a cautionary tale • Macaque monkeys given vaccine based on human T-cells infected with SIV (relative of HIV) • macaques gained immunity from SIV • Later, macaques given uninfected human T-cells • and macaques still gained immunity! • Control experiment not originally done • and not always obvious (you can’t control for all variables)

Ceiling and Floor Effects • Well designed experiments (with good controls) can still go wrong • What if all our algorithms do particularly well • Or they all do badly? • We’ve got little evidence to choose between them

Ceiling and Floor Effects • Ceiling effects arise when test problems are insufficiently challenging • floor effects the opposite, when problems too challenging • A problem in AI because we often repeatedly use the same benchmark sets • most benchmarks will lose their challenge eventually? • but how do we detect this effect?

Machine learning example • 14 datasets from UCI corpus of benchmarks • used as mainstay of ML community • Problem is learning classification rules • each item is vector of features and a classification • measure classification accuracy of method (max 100%) • Compare C4 with 1R*, two competing algorithms Rob Holte, Machine Learning, vol. 3, pp. 63-91, 1993 www.site.uottawa.edu/~holte/Publications/simple_rules.ps

Floor effects: machine learning example DataSet: BC CH GL G2 HD HE … Mean C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9 1R* 72.5 69.2 56.4 77 78 85.1 ... 83.8 Is 1R* above the floor of performance? How would we tell?

Floor effects: machine learning example DataSet: BC CH GL G2 HD HE … Mean C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9 1R* 72.5 69.2 56.4 77 78 85.1 ... 83.8 Baseline 70.3 52.2 35.5 53.4 54.5 79.4 … 59.9 “Baseline rule” puts all items in more popular category. 1R* is above baseline on most datasets A bit like the prime number joke? 1 is prime. 3 is prime. 5 is prime. So, baseline rule is that all odd numbers are prime.

Ceiling Effects: machine learning DataSet: BC GL HY LY MU … Mean C4 72 63.2 99.1 77.5 100.0 ... 85.9 1R* 72.5 56.4 97.2 70.7 98.4 ... 83.8 • How do we know that C4 and 1R* are not near the ceiling of performance? • Do the datasets have enough attributes to make perfect classification? • Obviously for MU, but what about the rest?

Ceiling Effects: machine learning DataSet: BC GL HY LY MU … Mean C4 72 63.2 99.1 77.5 100.0 ... 85.9 1R* 72.5 56.4 97.2 70.7 98.4 ... 83.8 max(C4,1R*) 72.5 63.2 99.1 77.5 100.0 … 87.4 max([Buntine]) 72.8 60.4 99.1 66.0 98.6 … 82.0 • C4 achieves only about 2% better than 1R* • Best of the C4/1R* achieves 87.4% accuracy • We have only weak evidence that C4 better • Both methods performing appear to be near ceiling of possible so comparison hard!

Ceiling Effects: machine learning • In fact 1R* only uses one feature (the best one) • C4 uses on average 6.6 features • 5.6 features buy only about 2% improvement • Conclusion? • Either real world learning problems are easy (use 1R*) • Or we need more challenging datasets • We need to be aware of ceiling effects in results

Sampling bias • Data collection is biased against certain data • e.g. teacher who says “Girls don’t answer maths question” • observation might suggest: • girls don’t answer many questions • but that the teacher doesn’t ask them many questions • Experienced AI researchers don’t do that, right?

Sampling bias: Phoenix case study • AI system to fight (simulated) forest fires • Experiments suggest that wind speed uncorrelated with time to put out fire • obviously incorrect as high winds spread forest fires

How to do Experiments: Empirical Methods for AI & CS