Empirical Methods for AI & CS

Empirical Methods for AI & CS Paul Cohen Ian P. Gent Toby Walsh cohen@cs.umass.edu ipg@dcs.st-and.ac.uk tw@cs.york.ac.uk

Introduction What are empirical methods? Why use them? Case Study Eight Basic Lessons Experiment design Data analysis How not to do it Supplementary material Overview

Resources • Web www.cs.york.ac.uk/~tw/empirical.html www.cs.amherst.edu/~dsj/methday.html • Books “Empirical Methods for AI”, Paul Cohen, MIT Press, 1995 • Journals Journal of Experimental Algorithmics, www.jea.acm.org • Conferences Workshop on Empirical Methods in AI (at ECAI-2000) Workshop on Algorithm Engineering and Experiments, ALENEX 01 (alongside SODA)

Empirical Methods for CS Part I : Introduction

What does “empirical” mean? • Relying on observations, data, experiments • Empirical work should complement theoretical work • Theories often have holes (e.g., How big is the constant term? Is the current problem a “bad” one?) • Theories are suggested by observations • Theories are tested by observations • Conversely, theories direct our empirical attention • In addition (in this tutorial at least) empirical means “wanting to understand behavior of complex systems”

Why We Need Empirical Methods Cohen, 1990 Survey of 150 AAAI Papers • Roughly 60% of the papers gave no evidence that the work they described had been tried on more than a single example problem. • Roughly 80% of the papers made no attempt to explain performance, to tell us why it was good or bad and under which conditions it might be better or worse. • Only 16% of the papers offered anything that might be interpreted as a question or a hypothesis. • Theory papers generally had no applications or empirical work to support them, empirical papers were demonstrations, not experiments, and had no underlying theoretical support. • The essential synergy between theory and empirical work was missing

Theory, not Theorems • Theory based science need not be all theorems • otherwise science would be mathematics • Consider theory of QED (Quantum Electro Dynamics) • based on a model of behaviour of particles • predictions accurate to many decimal places (9?) • most accurate theory in the whole of science? • success derived from accuracy of predictions • not the depth or difficulty or beauty of theorems • QED is an empirical theory!

Empirical CS/AI • Computer programs are formal objects • so let’s reason about them entirely formally? • Two reasons why we can’t or won’t: • theorems are hard • some questions are empirical in nature e.g. are Horn clauses adequate to represent the sort of knowledge met in practice? e.g. even though our problem is intractable in general, are the instances met in practice easy to solve?

Empirical CS/AI • Treat computer programs as natural objects • like fundamental particles, chemicals, living organisms • Build (approximate) theories about them • construct hypotheses e.g. greedy hill-climbing is important to GSAT • test with empirical experiments e.g. compare GSAT with other types of hill-climbing • refine hypotheses and modelling assumptions e.g. greediness not important, but hill-climbing is!

Empirical CS/AI • Many advantage over other sciences • Cost • no need for expensive super-colliders • Control • unlike the real world, we often have complete command of the experiment • Reproducibility • in theory, computers are entirely deterministic • Ethics • no ethics panels needed before you run experiments

Types of hypothesis • My search program is better than yours not very helpful beauty competition? • Search cost grows exponentially with number of variables for this kind of problem better as we can extrapolate to data not yet seen? • Constraint systems are better at handling over-constrained systems, but OR systems are better at handling under-constrained systems even better as we can extrapolate to new situations?

A typical conference conversation What are you up to these days? I’m running an experiment to compare the Davis-Putnam algorithm with GSAT? Why? I want to know which is faster Why? Lots of people use each of these algorithms How will these people use your result?...

Keep in mind the BIG picture What are you up to these days? I’m running an experiment to compare the Davis-Putnam algorithm with GSAT? Why? I have this hypothesis that neither will dominate What use is this? A portfolio containing both algorithms will be more robust than either algorithm on its own

Keep in mind the BIG picture ... Why are you doing this? Because many real problems are intractable in theory but need to be solved in practice. How does your experiment help? It helps us understand the difference between average and worst case results So why is this interesting? Intractability is one of the BIG open questions in CS!

Why is empirical CS/AI in vogue? • Inadequacies of theoretical analysis • problems often aren’t as hard in practice as theory predicts in the worst-case • average-case analysis is very hard (and often based on questionable assumptions) • Some “spectacular” successes • phase transition behaviour • local search methods • theory lagging behind algorithm design

Why is empirical CS/AI in vogue? • Compute power ever increasing • even “intractable” problems coming into range • easy to perform large (and sometimes meaningful) experiments • Empirical CS/AI perceived to be “easier” than theoretical CS/AI • often a false perception as experiments easier to mess up than proofs

Empirical Methods for CS Part II: A Case Study Eight Basic Lessons

Rosenberg study • “An Empirical Study of Dynamic Scheduling on Rings of Processors” Gregory, Gao, Rosenberg & Cohen Proc. of 8th IEEE Symp. on Parallel & Distributed Processing, 1996 Linked to from www.cs.york.ac.uk/~tw/empirical.html

Problem domain • Scheduling processors on ring network • jobs spawned as binary trees • KOSO • keep one, send one to my left or right arbitrarily • KOSO* • keep one, send one to my least heavily loaded neighbour

On complete binary trees, KOSO is asymptotically optimal So KOSO* can’t be any better? But assumptions unrealistic tree not complete asymptotically not necessarily the same as in practice! Thm: Using KOSO on a ring of p processors, a binary tree of height n is executed within (2^n-1)/p + low order terms Theory

Benefits of an empirical study • More realistic trees • probabilistic generator that makes shallow trees, which are “bushy” near root but quickly get “scrawny” • similar to trees generated when performing Trapezoid or Simpson’s Rule calculations • binary trees correspond to interval bisection • Startup costs • network must be loaded

Lesson 1: Evaluation begins with claimsLesson 2: Demonstration is good, understanding better • Hypothesis (or claim): KOSO takes longer than KOSO* because KOSO* balances loads better • The “because phrase” indicates a hypothesis about why it works. This is a better hypothesis than the beauty contest demonstration that KOSO* beats KOSO • Experiment design • Independent variables: KOSO v KOSO*, no. of processors, no. of jobs, probability(job will spawn), • Dependent variable: time to complete jobs

Criticism 1: This experiment design includes no direct measure of the hypothesized effect • Hypothesis: KOSO takes longer than KOSO* because KOSO* balances loads better • But experiment design includes no direct measure of load balancing: • Independent variables: KOSO v KOSO*, no. of processors, no. of jobs, probability(job will spawn), • Dependent variable: time to complete jobs

Lesson 3: Exploratory data analysis means looking beneath immediate results for explanations 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 20000 10000 20000 10000 • T-test on time to complete jobs: t = (2825-2935)/587 = -.19 • KOSO* apparently no faster than KOSO (as theory predicted) • Why? Look more closely at the data: • Outliers create excessive variance, so test isn’t significant KOSO KOSO*

Lesson 4: The task of empirical work is to explain variability Algorithm (KOSO/KOSO*) Number of processors run-time Number of jobs “random noise” (e.g., outliers) Empirical work assumes the variability in a dependent variable (e.g., run time) is the sum of causal factors and random noise. Statistical methods assign parts of this variability to the factors and the noise. Number of processors and number of jobs explain 74% of the variance in run time. Algorithm explains almost none.

Lesson 3 (again): Exploratory data analysis means looking beneath immediate results for explanations Queue length at processor i Queue length at processor i 50 KOSO* 30 KOSO 40 20 30 20 10 10 100 200 300 100 200 300 • Why does the KOSO/KOSO* choice account for so little of the variance in run time? • Unless processors starve, there will be no effect of load balancing. In most conditions in this experiment, processors never starved. (This is why we run pilot experiments!)

Lesson 5: Of sample variance, effect size, and sample size – control the first before touching the last magnitude of effect x - m t = s background variance N sample size This intimate relationship holds for all statistics

Lesson 5 illustrated: A variance reduction method Let N = num-jobs, P = num-processors, T = run time Then T = k (N / P), or k multiples of the theoretical best time And k = 1 / (N / P T) 90 70 80 60 70 50 60 40 50 40 30 30 20 20 10 10 2 3 4 5 2 3 4 5 k(KOSO) k(KOSO*)

Where are we? • KOSO* is significantly better than KOSO when the dependent variable is recoded as percentage of optimal run time • The difference between KOSO* and KOSO explains very little of the variance in either dependent variable • Exploratory data analysis tells us that processors aren’t starving so we shouldn’t be surprised • Prediction: The effect of algorithm on run time (or k) increases as the number of jobs increases or the number of processors increases • This prediction is about interactions between factors

Lesson 6: Most interesting science is about interaction effects, not simple main effects • Data confirm prediction • KOSO* is superior on larger rings where starvation is an issue • Interaction of independent variables • choice of algorithm • number of processors • Interaction effects are essential to explaining how things work multiples of optimal run-time 3 KOSO KOSO* 2 1 3 6 10 20 number of processors

Lesson 7: Significant and meaningful are not synonymous. Is a result meaningful? • KOSO* is significantly better than KOSO, but can you use the result? • Suppose you wanted to use the knowledge that the ring is controlled by KOSO or KOSO* for some prediction. • Grand median k = 1.11; Pr(trial i has k > 1.11) = .5 • Pr(trial i under KOSO has k > 1.11) = 0.57 • Pr(trial i under KOSO* has k > 1.11) = 0.43 • Predict for trial i whether it’s k is above or below the median: • If it’s a KOSO* trial you’ll say no with (.43 * 150) = 64.5 errors • If it’s a KOSO trial you’ll say yes with ((1 - .57) * 160) = 68.8 errors • If you don’t know you’ll make (.5 * 310) = 155 errors • 155 - (64.5 + 68.8) = 22 • Knowing the algorithm reduces error rate from .5 to .43. Is this enough???

Lesson 8: Keep the big picture in mind Why are you studying this? Load balancing is important to get good performance out of parallel computers Why is this important? Parallel computing promises to tackle many of our computational bottlenecks How do we know this? It’s in the first paragraph of the paper!

Case study: conclusions • Evaluation begins with claims • Demonstrations of simple main effects are good, understanding the effects is better • Exploratory data analysis means using your eyes to find explanatory patterns in data • The task of empirical work is to explain variablitity • Control variability before increasing sample size • Interaction effects are essential to explanations • Significant ≠ meaningful • Keep the big picture in mind

Empirical Methods for CS Part III : Experiment design

Experimental Life Cycle • Exploration • Hypothesis construction • Experiment • Data analysis • Drawing of conclusions

Checklist for experiment design* • Consider the experimental procedure • making it explicit helps to identify spurious effects and sampling biases • Consider a sample data table • identifies what results need to be collected • clarifies dependent and independent variables • shows whether data pertain to hypothesis • Consider an example of the data analysis • helps you to avoid collecting too little or too much data • especially important when looking for interactions • *From Chapter 3, “Empirical Methods for Artificial Intelligence”, Paul Cohen, MIT Press

Guidelines for experiment design • Consider possible results and their interpretation • may show that experiment cannot support/refute hypotheses under test • unforeseen outcomes may suggest new hypotheses • What was the question again? • easy to get carried away designing an experiment and lose the BIG picture • Run a pilot experiment to calibrate parameters (e.g., number of processors in Rosenberg experiment)

Types of experiment • Manipulation experiment • Observation experiment • Factorial experiment

Manipulation experiment • Independent variable, x • x=identity of parser, size of dictionary, … • Dependent variable, y • y=accuracy, speed, … • Hypothesis • x influences y • Manipulation experiment • change x, record y

Observation experiment • Predictor, x • x=volatility of stock prices, … • Response variable, y • y=fund performance, … • Hypothesis • x influences y • Observation experiment • classify according to x, compute y

Factorial experiment • Several independent variables, xi • there may be no simple causal links • data may come that way e.g. individuals will have different sexes, ages, ... • Factorial experiment • every possible combination of xi considered • expensive as its name suggests!

Designing factorial experiments • In general, stick to 2 to 3 independent variables • Solve same set of problems in each case • reduces variance due to differences between problem sets • If this not possible, use same sample sizes • simplifies statistical analysis • As usual, default hypothesis is that no influence exists • much easier to fail to demonstrate influence than to demonstrate an influence

Some problem issues • Control • Ceiling and Floor effects • Sampling Biases

Control • A control is an experiment in which the hypothesised variation does not occur • so the hypothesized effect should not occur either • BUT remember • placebos cure a large percentage of patients!

Control: a cautionary tale • Macaque monkeys given vaccine based on human T-cells infected with SIV (relative of HIV) • macaques gained immunity from SIV • Later, macaques given uninfected human T-cells • and macaques still gained immunity! • Control experiment not originally done • and not always obvious (you can’t control for all variables)

Control: MYCIN case study • MYCIN was a medial expert system • recommended therapy for blood/meningitis infections • How to evaluate its recommendations? • Shortliffe used • 10 sample problems, 8 therapy recommenders • 5 faculty, 1 resident, 1 postdoc, 1 student • 8 impartial judges gave 1 point per problem • max score was 80 • Mycin 65, faculty 40-60, postdoc 60, resident 45, student 30

Control: MYCIN case study • What were controls? • Control for judge’s bias for/against computers • judges did not know who recommended each therapy • Control for easy problems • medical student did badly, so problems not easy • Control for our standard being low • e.g. random choice should do worse • Control for factor of interest • e.g. hypothesis in MYCIN that “knowledge is power” • have groups with different levels of knowledge

Ceiling and Floor Effects • Well designed experiments (with good controls) can still go wrong • What if all our algorithms do particularly well • Or they all do badly? • We’ve got little evidence to choose between them

Ceiling and Floor Effects • Ceiling effects arise when test problems are insufficiently challenging • floor effects the opposite, when problems too challenging • A problem in AI because we often repeatedly use the same benchmark sets • most benchmarks will lose their challenge eventually? • but how do we detect this effect?

Ceiling Effects: machine learning • 14 datasets from UCI corpus of benchmarks • used as mainstay of ML community • Problem is learning classification rules • each item is vector of features and a classification • measure classification accuracy of method (max 100%) • Compare C4 with 1R*, two competing algorithms

Empirical Methods for AI & CS