Artificial Intelligence Empirical Evaluation of AI Systems Ian Gent email@example.com
Artificial Intelligence Empirical Evaluation of Computer Systems Part I : Philosophy of Science Part II: Experiments in AI Part III: Basics of Experimental Design with AI case studies
Science as Refutation • Modern view of the progress of Science based on Popper. (Sir Karl Popper, that is) • A scientific theory is one that can be refuted • I.e. it should make testable predictions • If these predictions are incorrect, the theory is false • theory may still be useful, e.g. Newtonian physics • Therefore science is hypothesis testing • Artificial intelligence aspires to be a science
Empirical Science • Empirical = “Relying upon or derived from observation or experiment” • Most (all) of Science is empirical. • Consider theoretical computer science • study based on Turing machines, lambda calculus, etc • Founded on empirical observation that computer systems developed to date are Turing-complete • Quantum computers might challenge this • if so, an empirically based theory of quantum computing will develop
Theory, not Theorems • Theory based science need not be all theorems • otherwise science would be Mathematics • Compare Physics theory “QED” • most accurate theory in the whole of science? • based on a model of behaviour of particles • predictions accurate to many decimal places (9?) • success derived from accuracy of predictions • not the depth or difficulty or beauty of theorems • I.e. QED is an empirical theory • AI/CS has too many theorems and not enough theory • compare advice on how to publish in JACM
Empirical CS/AI • Computer programs are formal objects • so some use only theory that can be proved by theorems • but theorems are hard • Treat computer programs as natural objects • like quantum particles, chemicals, living objects • perform empirical experiments • We have a huge advantage over other sciences • no need for supercolliders (expensive) or animal experiments (ethical problems) • we should have complete command of experiments
What are our hypotheses? • My search program is better than yours • Search cost grows exponentially with number of variables for this kind of problem • Constraint search systems are better at handling overconstrained systems, but OR systems are better at handling underconstrained systems • My company should buy an AI search system rather than an OR one
Why do experiments? • Too often AI experimenters might talk like this: • What is your experiment for? • is my algorithm better than his? • Why? • I want to know which is faster • Why? • Lots of people use each kind … • How will these people use your result? • ?
Why do experiments? • Compare experiments on identical twins: • What is your experiment for? • I want to find out if twins reared apart to those reared together and nonidentical twins too. • Why? • We can get estimates of the genetic and social contributors to performance • Why? • Because the role of genetics in behavior is one of the great unsolved questions. • Experiments should address research questions • otherwise they can just be “track meets”
Basic issues in Experimental Design • From Paul R Cohen, Empirical Methods for Artificial Intelligence, MIT Press, 1995, Chapter 3 • Control • Ceiling and Floor effects • Sampling Biases
Control • A control is an experiment in which the hypothesised variation does not occur • so the hypothesised effect should not occur either • e.g. Macaque monkeys given vaccine based on human T-cells infected with SIV (relative of HIV) • macaques gained immunity from SIV • Later, macaques given uninfected human T-cells • and macaques still gained immunity! • Control experiment not originally done • and not always obvious (you can’t control for all variables)
Case Study: MYCIN • MYCIN was a medial expert system • recommended therapy for blood/meningitis infections • How to evaluate its recommendations? • Shortliffe used • 10 sample problems • 8 other therapy recommenders • 5 faculty at Stanford Med. School, 1 senior resident, 1 senior postdoctoral researcher, 1 senior student • 8 impartial judges gave 1 point per problem • Max score was 80 • Mycin: 65 Faculty 40-60, Fellow 60, Resident 45, Student 30
Case Study: MYCIN • What were controls? • Control for judge’s bias for/against computers • judges did not know who recommended each therapy • Control for easy problems • medical student did badly, so problems not easy • Control for our standard being low • e.g. random choice should do worse • Control for factor of interest • e.g. hypothesis in MYCIN that “knowledge is power” • have groups with different levels of knowledge
Ceiling and Floor Effects • Well designed experiments can go wrong • What if all our algorithms do particularly well (or they all do badly)? • We’ve got little evidence to choose between them • Ceiling effects arise when test problems are insufficiently challenging • floor effects the opposite, when problems too challenging • A problem in AI because we often use benchmark sets • But how do we detect the effect?
Ceiling Effects: Machine Learning • 14 datasets from UCI corpus of benchmarks • used as mainstay of ML community • Problem is learning classification rules • each item is vector of features and a classification • measure classification accuracy of method (max 100%) • Compare C4 with 1R*, two competing algorithms: • DataSet: BC CH GL G2 HD HE … Mean • C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9 • 1R* 72.5 69.2 56.4 77 78 85.1 ... 83.8
Ceiling Effects • DataSet: BC CH GL G2 HD HE … Mean • C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9 • 1R* 72.5 69.2 56.4 77 78 85.1 ... 83.8 • Max 72.5 99.2 63.2 77 78 85.1 … 87.4 • C4 achieves only about 2% better than 1R* • If we take the best of the C4/1R* in each case, we can only achieve 87.4% accuracy • We have only weak evidence that C4 better • both methods performing near ceiling of possible • Ceiling effect is that we can’t compare the two methods well because both are achieving near the best practicable
Ceiling Effects • In fact 1R* only uses one feature (the best one) • C4 uses on average 6.6 features • 5.6 features buy only about 2% improvement • Conclusion? • Either real world learning problems are easy (use 1R*) • Or we need more challenging datasets • We need to be aware of ceiling effects in results
Sampling Bias • Sampling bias is when data collection is biased against certain data • e.g. teacher who says “Girls don’t answer maths question” • observation might suggest that … • indeed girls don’t answer many questions • but that the teacher doesn’t ask them many questions • Experienced AI researchers don’t do that, right?
Case Study: Phoenix • Phoenix = AI system to fight (simulated) forest fires • Experiments suggested that wind speed uncorrelated with time to put out fire • obviously incorrect (high winds spread forest fires) • Wind Speed vs containment time (max 150 hours): • 3: 120 55 79 10 140 26 15 110 12 54 10 103 • 6: 78 61 58 81 71 57 21 32 70 • 9: 62 48 21 55 101 • What’s the problem?
Sampling bias in Phoenix • The cut-off of 150 hours introduces sampling bias • Many high-wind fires get cut off, not many low wind • On remaining data, there is no correlation between wind speed and time (r = -0.53) • In fact, data shows that: • a lot of high wind fires take > 150 hours to contain • those that don’t are similar to low wind fires • You wouldn’t do this, right?You might if you had automated data analysis.