Empirical methods for ai and cs
Download
1 / 49

empirical methods for ai and cs - PowerPoint PPT Presentation


  • 239 Views
  • Updated On :

Empirical Methods for AI (and CS). CS1573: AI Application Development (after a tutorial by Paul Cohen, Ian P. Gent, and Toby Walsh). Introduction What are empirical methods? Why use them? Case Study Experiment design Data analysis. Overview. Empirical Methods for CS.

Related searches for empirical methods for ai and cs

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'empirical methods for ai and cs' - jacob


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Empirical methods for ai and cs

Empirical Methods for AI (and CS)

CS1573: AI Application Development

(after a tutorial by Paul Cohen, Ian P. Gent, and Toby Walsh)


Overview

Introduction

What are empirical methods?

Why use them?

Case Study

Experiment design

Data analysis

Overview


Empirical methods for cs

Empirical Methods for CS

Part I : Introduction


What does empirical mean
What does “empirical” mean?

  • Relying on observations, data, experiments

  • Empirical work should complement theoretical work

    • Theories often have holes

    • Theories are suggested by observations

    • Theories are tested by observations

    • Conversely, theories direct our empirical attention

  • In addition (in this course at least) empirical means “wanting to understand behavior of complex systems”


Why we need empirical methods cohen 1990 survey of 150 aaai papers
Why We Need Empirical Methods Cohen, 1990 Survey of 150 AAAI Papers

  • Roughly 60% of the papers gave no evidence that the work they described had been tried on more than a single example problem.

  • Roughly 80% of the papers made no attempt to explain performance, to tell us why it was good or bad and under which conditions it might be better or worse.

  • Only 16% of the papers offered anything that might be interpreted as a question or a hypothesis.

  • Theory papers generally had no applications or empirical work to support them, empirical papers were demonstrations, not experiments, and had no underlying theoretical support.

  • The essential synergy between theory and empirical work was missing


Empirical cs ai theory not theorems
Empirical CS/AI (Theory, not Theorems)

  • Computer programs are formal objects

    • so let’s reason about them entirely formally?

  • Two reasons why we can’t or won’t:

    • theorems are hard

    • some questions are empirical in nature

      e.g. are finite state automata adequate to represent the sort of knowledge met in practice?

      e.g. even though our problem is intractable in general, are the instances met in practice easy to solve?


Empirical cs ai
Empirical CS/AI

  • Treat computer programs as natural objects

    • like fundamental particles, chemicals, living organisms

  • Build (approximate) theories about them

    • construct hypotheses

      e.g. greedy search is important to chess

    • test with empirical experiments

      e.g. compare search with planning

    • refine hypotheses and modelling assumptions

      e.g. greediness not important, but search is!


Empirical cs ai1
Empirical CS/AI

  • Many advantage over other sciences

  • Cost

    • no need for expensive super-colliders

  • Control

    • unlike the real world, we often have complete command of the experiment

  • Reproducibility

    • in theory, computers are entirely deterministic

  • Ethics

    • no ethics panels needed before you run experiments


Types of hypothesis
Types of hypothesis

  • My search program is better than yours

    not very helpful beauty competition?

  • Search cost grows exponentially with number of variables for this kind of problem

    better as we can extrapolate to data not yet seen?

  • Constraint systems are better at handling over-constrained systems, but OR systems are better at handling under-constrained systems

    even better as we can extrapolate to new situations?


A typical conference conversation
A typical conference conversation

What are you up to these days?

I’m running an experiment to compare the Davis-Putnam algorithm with GSAT?

Why?

I want to know which is faster

Why?

Lots of people use each of these algorithms

How will these people use your result?...


Keep in mind the big picture
Keep in mind the BIG picture

What are you up to these days?

I’m running an experiment to compare the Davis-Putnam algorithm with GSAT?

Why?

I have this hypothesis that neither will dominate

What use is this?

A portfolio containing both algorithms will be more robust than either algorithm on its own


Keep in mind the big picture1
Keep in mind the BIG picture

...

Why are you doing this?

Because many real problems are intractable in theory but need to be solved in practice.

How does your experiment help?

It helps us understand the difference between average and worst case results

So why is this interesting?

Intractability is one of the BIG open questions in CS!


Why is empirical cs ai in vogue
Why is empirical CS/AI in vogue?

  • Inadequacies of theoretical analysis

    • problems often aren’t as hard in practice as theory predicts in the worst-case

    • average-case analysis is very hard (and often based on questionable assumptions)

  • Some “spectacular” successes

    • phase transition behaviour

    • local search methods

    • theory lagging behind algorithm design


Why is empirical cs ai in vogue1
Why is empirical CS/AI in vogue?

  • Compute power ever increasing

    • even “intractable” problems coming into range

    • easy to perform large (and sometimes meaningful) experiments

  • Empirical CS/AI perceived to be “easier” than theoretical CS/AI

    • often a false perception as experiments easier to mess up than proofs


Empirical methods for cs1

Empirical Methods for CS

Part II: A Case Study


Case study
Case Study

  • Scheduling processors on ring network

    • jobs spawned as binary trees

  • KOSO

    • keep one, send one to my left or right arbitrarily

  • KOSO*

    • keep one, send one to my least heavily loaded neighbour


Theory

On complete binary trees, KOSO is asymptotically optimal

So KOSO* can’t be any better?

But assumptions unrealistic

tree not complete

asymptotically not necessarily the same as in practice!

Theory


Lesson evaluation begins with claims lesson demonstration is good understanding better
Lesson: Evaluation begins with claimsLesson: Demonstration is good, understanding better

  • Hypothesis (or claim): KOSO takes longer than KOSO* because KOSO* balances loads better

    • The “because phrase” indicates a hypothesis about why it works. This is a better hypothesis than the beauty contest demonstration that KOSO* beats KOSO

  • Experiment design

    • Independent variables: KOSO v KOSO*, no. of processors, no. of jobs, probability(job will spawn),

    • Dependent variable: time to complete jobs


Criticism this experiment design includes no direct measure of the hypothesized effect
Criticism: This experiment design includes no direct measure of the hypothesized effect

  • Hypothesis: KOSO takes longer than KOSO* because KOSO* balances loads better

  • But experiment design includes no direct measure of load balancing:

    • Independent variables: KOSO v KOSO*, no. of processors, no. of jobs, probability(job will spawn),

    • Dependent variable: time to complete jobs


Lesson the task of empirical work is to explain variability
Lesson: The task of empirical work is to explain variability of the hypothesized effect

Algorithm (KOSO/KOSO*)

Number of processors

run-time

Number of jobs

“random noise” (e.g., outliers)

Empirical work assumes the variability in a dependent variable (e.g., run time) is the sum of causal factors and random noise. Statistical methods assign parts of this variability to the factors and the noise.

Number of processors and number of jobs explain 74% of the variance in run time. Algorithm explains almost none.


Lesson keep the big picture in mind
Lesson: Keep the big picture in mind of the hypothesized effect

Why are you studying this?

Load balancing is important to get good performance out of parallel computers

Why is this important?

Parallel computing promises to tackle many of our computational bottlenecks

How do we know this? It’s in the first paragraph of the paper!


Empirical methods for cs2

Empirical Methods for CS of the hypothesized effect

Part III : Experiment design


Experimental life cycle
Experimental Life Cycle of the hypothesized effect

  • Exploration

  • Hypothesis construction

  • Experiment

  • Data analysis

  • Drawing of conclusions


Types of experiment designs
Types of experiment designs of the hypothesized effect

  • Manipulation experiment

  • Observation experiment

  • Factorial experiment


Manipulation experiment
Manipulation experiment of the hypothesized effect

  • Independent variable, x

    • x=identity of parser, size of dictionary, …

  • Dependent variable, y

    • y=accuracy, speed, …

  • Hypothesis

    • x influences y

  • Manipulation experiment

    • change x, record y


Observation experiment
Observation experiment of the hypothesized effect

  • Predictor, x

    • x=volatility of stock prices, …

  • Response variable, y

    • y=fund performance, …

  • Hypothesis

    • x influences y

  • Observation experiment

    • classify according to x, compute y


Factorial experiment
Factorial experiment of the hypothesized effect

  • Several independent variables, xi

    • there may be no simple causal links

    • data may come that way

      e.g. programs have different languages, algorithms, ...

  • Factorial experiment

    • every possible combination of xi considered

    • expensive as its name suggests!


Some problem issues
Some problem issues of the hypothesized effect

  • Control

  • Ceiling and Floor effects

  • Sampling Biases


Control
Control of the hypothesized effect

  • A control is an experiment in which the hypothesised variation does not occur

    • so the hypothesized effect should not occur either

  • BUT remember

    • placebos cure a large percentage of patients!


Control mycin case study
Control: MYCIN case study of the hypothesized effect

  • MYCIN was a medial expert system

    • recommended therapy for blood/meningitis infections

  • How to evaluate its recommendations?

  • Shortliffe used

    • 10 sample problems, 8 therapy recommenders

      • 5 faculty, 1 resident, 1 postdoc, 1 student

    • 8 impartial judges gave 1 point per problem

    • max score was 80

    • Mycin 65, faculty 40-60, postdoc 60, resident 45, student 30


Control mycin case study1
Control: MYCIN case study of the hypothesized effect

  • What were controls?

  • Control for judge’s bias for/against computers

    • judges did not know who recommended each therapy

  • Control for easy problems

    • medical student did badly, so problems not easy

  • Control for our standard being low

    • e.g. random choice should do worse

  • Control for factor of interest

    • e.g. hypothesis in MYCIN that “knowledge is power”

    • have groups with different levels of knowledge


Ceiling and floor effects
Ceiling and Floor Effects of the hypothesized effect

  • Well designed experiments (with good controls) can still go wrong

  • What if all our algorithms do particularly well

    • Or they all do badly?

  • We’ve got little evidence to choose between them


Ceiling and floor effects1
Ceiling and Floor Effects of the hypothesized effect

  • Ceiling effects arise when test problems are insufficiently challenging

    • floor effects the opposite, when problems too challenging

  • A problem in AI because we often repeatedly use the same benchmark sets

    • most benchmarks will lose their challenge eventually?

    • but how do we detect this effect?


Ceiling effects machine learning
Ceiling Effects: machine learning of the hypothesized effect

  • 14 datasets from UCI corpus of benchmarks

    • used as mainstay of ML community

  • Problem is learning classification rules

    • each item is vector of features and a classification

    • measure classification accuracy of method (max 100%)

  • Compare C4 with 1R*, two competing algorithms


Ceiling effects machine learning1
Ceiling Effects: machine learning of the hypothesized effect

DataSet: BC CH GL G2 HD HE … Mean

C4 72 99.2 63.2 74.3 73.6 81.2 ... 85.9

1R* 72.5 69.2 56.4 77 78 85.1 ... 83.8

Max 72.5 99.2 63.2 77 78 85.1 … 87.4

  • C4 achieves only about 2% better than 1R*

  • Best of the C4/1R* achieves 87.4% accuracy

  • We have only weak evidence that C4 better

  • Both methods performing near ceiling of possible so comparison hard!


Ceiling effects machine learning2
Ceiling Effects: machine learning of the hypothesized effect

  • In fact 1R* only uses one feature (the best one)

  • C4 uses on average 6.6 features

  • 5.6 features buy only about 2% improvement

  • Conclusion?

    • Either real world learning problems are easy (use 1R*)

    • Or we need more challenging datasets

    • We need to be aware of ceiling effects in results


Sampling bias
Sampling bias of the hypothesized effect

  • Data collection is biased against certain data

    • e.g. teacher who says “Girls don’t answer maths question”

    • observation might suggest:

      • girls don’t answer many questions

      • but that the teacher doesn’t ask them many questions

  • Experienced AI researchers don’t do that, right?


Sampling bias phoenix case study
Sampling bias: Phoenix case study of the hypothesized effect

  • AI system to fight (simulated) forest fires

  • Experiments suggest that wind speed uncorrelated with time to put out fire

    • obviously incorrect as high winds spread forest fires


Sampling bias phoenix case study1
Sampling bias: Phoenix case study of the hypothesized effect

  • Wind Speed vs containment time (max 150 hours):

    3: 120 55 79 10 140 26 15 110 12 54 10 103

    6: 78 61 58 81 71 57 21 32 70

    9: 62 48 21 55 101

  • What’s the problem?


Sampling bias phoenix case study2
Sampling bias: Phoenix case study of the hypothesized effect

  • The cut-off of 150 hours introduces sampling bias

    • many high-wind fires get cut off, not many low wind

  • On remaining data, there is no correlation between wind speed and time (r = -0.53)

  • In fact, data shows that:

    • a lot of high wind fires take > 150 hours to contain

    • those that don’t are similar to low wind fires

  • You wouldn’t do this, right?

    • you might if you had automated data analysis.


Empirical methods for cs3

Empirical Methods for CS of the hypothesized effect

Part IV: Data analysis


Kinds of data analysis
Kinds of data analysis of the hypothesized effect

  • Exploratory (EDA) – looking for patterns in data

  • Statistical inferences from sample data

    • Testing hypotheses

    • Estimating parameters

  • Building mathematical models of datasets

  • Machine learning, data mining…

  • We will introduce hypothesis testing and computer-intensive methods


The logic of hypothesis testing
The logic of hypothesis testing of the hypothesized effect

  • Example: toss a coin ten times, observe eight heads. Is the coin fair (i.e., what is it’s long run behavior?) and what is your residual uncertainty?

  • You say, “If the coin were fair, then eight or more heads is pretty unlikely, so I think the coin isn’t fair.”

  • Like proof by contradiction: Assert the opposite (the coin is fair) show that the sample result (≥ 8 heads) has low probability p, reject the assertion, with residual uncertainty related to p.

  • Estimate p with a sampling distribution.


The logic of hypothesis testing1
The logic of hypothesis testing of the hypothesized effect

0

1

2

3

4

5

6

7

8

9

10

  • Establish a null hypothesis: H0: p = .5, the coin is fair

  • Establish a statistic: r, the number of heads in N tosses

  • Figure out the sampling distribution of r given H0

  • The sampling distribution will tell you the probability p of a result at least as extreme as your sample result, r = 8

  • If this probability is very low, reject H0 the null hypothesis

  • Residual uncertainty is p


A common statistical test the z test for different means
A common statistical test: The Z test for different means of the hypothesized effect

  • A sample N = 25 computer science students has mean IQ m=135. Are they “smarter than average”?

  • Population mean is 100 with standard deviation 15

  • The null hypothesis, H0, is that the CS students are “average”, i.e., the mean IQ of the population of CS students is 100.

  • What is the probability p of drawing the sample if H0 were true? If p small, then H0 probably false.

  • Find the sampling distribution of the mean of a sample of size 25, from population with mean 100


Reject the null hypothesis
Reject the null hypothesis? of the hypothesized effect

  • Commonly we reject the H0 when the probability of obtaining a sample statistic (e.g., mean = 135) given the null hypothesis is low, say < .05.

  • A test statistic value, e.g. Z = 11.67, recodes the sample statistic (mean = 135) to make it easy to find the probability of sample statistic given H0.

  • We find the probabilities by looking them up in tables, or statistics packages provide them.

    • For example, Pr(Z ≥ 1.67) = .05; Pr(Z ≥ 1.96) = .01.

  • Pr(Z ≥ 11) is approximately zero, reject H0.


Summary of hypothesis testing
Summary of hypothesis testing of the hypothesized effect

  • H0 negates what you want to demonstrate; find probability p of sample statistic under H0 by comparing test statistic to sampling distribution; if probability is low, reject H0 with residual uncertainty proportional to p.

  • Example: Want to demonstrate that CS graduate students are smarter than average. H0 is that they are average. t = 2.89, p ≤ .022

  • Have we proved CS students are smarter? NO!

  • We have only shown that mean = 135 is unlikely if they aren’t. We never prove what we want to demonstrate, we only reject H0, with residual uncertainty.

  • And failing to reject H0 does not prove H0, either!


Computer intensive methods
Computer-intensive Methods of the hypothesized effect

  • Basic idea: Construct sampling distributions by simulating on a computer the process of drawing samples.

  • Three main methods:

    • Monte carlo simulation when one knows population parameters;

    • Bootstrap when one doesn’t;

    • Randomization, also assumes nothing about the population.

  • Enormous advantage: Works for any statistic and makes no strong assumptions (e.g., normality)


Summary
Summary of the hypothesized effect

  • Empirical CS and AI are exacting sciences

    • There are many ways to do experiments wrong

  • We are experts in doing experiments badly

  • As you perform experiments, you’ll make many mistakes

  • Learn from those mistakes, and ours!


ad