1 / 20

Artificial Intelligence

Artificial Intelligence. Empirical Evaluation of AI Systems. Ian Gent ipg@cs.st-and.ac.uk. Artificial Intelligence. Empirical Evaluation of Computer Systems. Part I : Philosophy of Science Part II: Experiments in AI Part III: Basics of Experimental Design with AI case studies.

Download Presentation

Artificial Intelligence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Artificial Intelligence Empirical Evaluation of AI Systems Ian Gent ipg@cs.st-and.ac.uk

  2. Artificial Intelligence Empirical Evaluation of Computer Systems Part I : Philosophy of Science Part II: Experiments in AI Part III: Basics of Experimental Design with AI case studies

  3. Science as Refutation • Modern view of the progress of Science based on Popper. (Sir Karl Popper, that is) • A scientific theory is one that can be refuted • I.e. it should make testable predictions • If these predictions are incorrect, the theory is false • theory may still be useful, e.g. Newtonian physics • Therefore science is hypothesis testing • Artificial intelligence aspires to be a science

  4. Empirical Science • Empirical = “Relying upon or derived from observation or experiment” • Most (all) of Science is empirical. • Consider theoretical computer science • study based on Turing machines, lambda calculus, etc • Founded on empirical observation that computer systems developed to date are Turing-complete • Quantum computers might challenge this • if so, an empirically based theory of quantum computing will develop

  5. Theory, not Theorems • Theory based science need not be all theorems • otherwise science would be Mathematics • Compare Physics theory “QED” • most accurate theory in the whole of science? • based on a model of behaviour of particles • predictions accurate to many decimal places (9?) • success derived from accuracy of predictions • not the depth or difficulty or beauty of theorems • I.e. QED is an empirical theory • AI/CS has too many theorems and not enough theory • compare advice on how to publish in JACM

  6. Empirical CS/AI • Computer programs are formal objects • so some use only theory that can be proved by theorems • but theorems are hard • Treat computer programs as natural objects • like quantum particles, chemicals, living objects • perform empirical experiments • We have a huge advantage over other sciences • no need for supercolliders (expensive) or animal experiments (ethical problems) • we should have complete command of experiments

  7. What are our hypotheses? • My search program is better than yours • Search cost grows exponentially with number of variables for this kind of problem • Constraint search systems are better at handling overconstrained systems, but OR systems are better at handling underconstrained systems • My company should buy an AI search system rather than an OR one

  8. Why do experiments? • Too often AI experimenters might talk like this: • What is your experiment for? • is my algorithm better than his? • Why? • I want to know which is faster • Why? • Lots of people use each kind … • How will these people use your result? • ?

  9. Why do experiments? • Compare experiments on identical twins: • What is your experiment for? • I want to find out if twins reared apart to those reared together and nonidentical twins too. • Why? • We can get estimates of the genetic and social contributors to performance • Why? • Because the role of genetics in behavior is one of the great unsolved questions. • Experiments should address research questions • otherwise they can just be “track meets”

  10. Basic issues in Experimental Design • From Paul R Cohen, Empirical Methods for Artificial Intelligence, MIT Press, 1995, Chapter 3 • Control • Ceiling and Floor effects • Sampling Biases

  11. Control • A control is an experiment in which the hypothesised variation does not occur • so the hypothesised effect should not occur either • e.g. Macaque monkeys given vaccine based on human T-cells infected with SIV (relative of HIV) • macaques gained immunity from SIV • Later, macaques given uninfected human T-cells • and macaques still gained immunity! • Control experiment not originally done • and not always obvious (you can’t control for all variables)

  12. Case Study: MYCIN • MYCIN was a medial expert system • recommended therapy for blood/meningitis infections • How to evaluate its recommendations? • Shortliffe used • 10 sample problems • 8 other therapy recommenders • 5 faculty at Stanford Med. School, 1 senior resident, 1 senior postdoctoral researcher, 1 senior student • 8 impartial judges gave 1 point per problem • Max score was 80 • Mycin: 65 Faculty 40-60, Fellow 60, Resident 45, Student 30

  13. Case Study: MYCIN • What were controls? • Control for judge’s bias for/against computers • judges did not know who recommended each therapy • Control for easy problems • medical student did badly, so problems not easy • Control for our standard being low • e.g. random choice should do worse • Control for factor of interest • e.g. hypothesis in MYCIN that “knowledge is power” • have groups with different levels of knowledge

  14. Ceiling and Floor Effects • Well designed experiments can go wrong • What if all our algorithms do particularly well (or they all do badly)? • We’ve got little evidence to choose between them • Ceiling effects arise when test problems are insufficiently challenging • floor effects the opposite, when problems too challenging • A problem in AI because we often use benchmark sets • But how do we detect the effect?

  15. Ceiling Effects: Machine Learning • 14 datasets from UCI corpus of benchmarks • used as mainstay of ML community • Problem is learning classification rules • each item is vector of features and a classification • measure classification accuracy of method (max 100%) • Compare C4 with 1R*, two competing algorithms: • DataSet: BC CH GL G2 HD HE … Mean • C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9 • 1R* 72.5 69.2 56.4 77 78 85.1 ... 83.8

  16. Ceiling Effects • DataSet: BC CH GL G2 HD HE … Mean • C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9 • 1R* 72.5 69.2 56.4 77 78 85.1 ... 83.8 • Max 72.5 99.2 63.2 77 78 85.1 … 87.4 • C4 achieves only about 2% better than 1R* • If we take the best of the C4/1R* in each case, we can only achieve 87.4% accuracy • We have only weak evidence that C4 better • both methods performing near ceiling of possible • Ceiling effect is that we can’t compare the two methods well because both are achieving near the best practicable

  17. Ceiling Effects • In fact 1R* only uses one feature (the best one) • C4 uses on average 6.6 features • 5.6 features buy only about 2% improvement • Conclusion? • Either real world learning problems are easy (use 1R*) • Or we need more challenging datasets • We need to be aware of ceiling effects in results

  18. Sampling Bias • Sampling bias is when data collection is biased against certain data • e.g. teacher who says “Girls don’t answer maths question” • observation might suggest that … • indeed girls don’t answer many questions • but that the teacher doesn’t ask them many questions • Experienced AI researchers don’t do that, right?

  19. Case Study: Phoenix • Phoenix = AI system to fight (simulated) forest fires • Experiments suggested that wind speed uncorrelated with time to put out fire • obviously incorrect (high winds spread forest fires) • Wind Speed vs containment time (max 150 hours): • 3: 120 55 79 10 140 26 15 110 12 54 10 103 • 6: 78 61 58 81 71 57 21 32 70 • 9: 62 48 21 55 101 • What’s the problem?

  20. Sampling bias in Phoenix • The cut-off of 150 hours introduces sampling bias • Many high-wind fires get cut off, not many low wind • On remaining data, there is no correlation between wind speed and time (r = -0.53) • In fact, data shows that: • a lot of high wind fires take > 150 hours to contain • those that don’t are similar to low wind fires • You wouldn’t do this, right?You might if you had automated data analysis.

More Related