Empirical Research Methods in Computer Science

Empirical Research Methods in Computer Science Lecture 4 November 2, 2005 Noah Smith

Today • Review bootstrap estimate of se (from homework). • Review sign and permutation tests for paired samples. • Lots of examples of hypothesis tests.

Recall ... • There is a true value of the statistic. But we don’t know it. • We can compute the sample statistic. • We know sample means are normally distrubuted (as n gets big):

But we don’t know anything about the distribution of other sample statistics (medians, correlations, etc.)!

Bootstrap world unknown distribution F empirical distribution observed random sample X bootstrap random sample X* statistic of interest bootstrap replication statistics about the estimate (e.g., standard error)

Bootstrap estimate of se • Run B bootstrap replicates, and compute the statistic each time: θ*[1], θ*[2], θ*[3], ..., θ*[B] (mean of θ* across replications) (sample standard deviation of θ* across replications)

Paired-Sample Design • pairs (xi, yi) • x ~ distribution F • y ~ distribution G • How do F and G differ?

Sign Test • H0: F and G have the same median median(F) – median(G) = 0 • Pr(x > y) = 0.5 • sign(x – y) ~ binomial distribution • compute bin(N+, 0.5)

Sign Test • nonparametric (no assumptions about the data) • closed form (no random sampling)

Example: gzip speed • build gzip with –O2 or with –O0 on about 650 files out of 1000, gzip-O2 was faster binomial distribution, p = 0.5, n = 1000 p < 3 x 10-24

Permutation Test • H0: F = G • Suppose difference in sample means is d. • How likely is this difference (or a greater one) under H0? • For i = 1 to P • Randomly permute each (xi, yi) • Compute difference in sample means

Permutation Test • nonparametric (no assumptions about the data) • randomized test

Example: gzip speed 1000 permutations: difference of sample means under H0 is centered on 0 -1579 is very extreme; p ≈ 0

Comparing speed is tricky! • It is very difficult to control for everything that could affect runtime. • Solution 1: do the best you can. • Solution 2: many runs, and then do ANOVA tests (or their nonparametric equivalents). “Is there more variance between conditions than within conditions?”

Sampling method 1 • for r = 1 to 10 • for each file f • for each program p • time p on f

Result (gzip first) student 2’s program faster than gzip!

Result (student first) student 2’s program is slower than gzip!

Sampling method 1 • for r = 1 to 10 • for each file f • for each program p • time p on f

Order effects • Well-known in psychology. • What the subject does at time t will affect what she does at time t+1.

Sampling method 2 • for r = 1 to 10 • for each program p • for each file f • time p on f

Result gzip wins

Sign and Permutation Tests all distribution pairs (F, G) F  G median(F)  median(G)

Sign and Permutation Tests all distribution pairs (F, G) F  G median(F)  median(G) sign test rejects H0 

Sign and Permutation Tests all distribution pairs (F, G) F  G permutation test rejects H0 median(F)  median(G) 

Sign and Permutation Tests all distribution pairs (F, G) F  G permutation test rejects H0 median(F)  median(G) sign test rejects H0  

There are other tests! • We have chosen two that are • nonparametric • easy to implement • Others include: • Wilcoxon Signed Rank Test • Kruskal-Wallis (nonparametric “ANOVA”)

Pre-increment? • Conventional wisdom: “Better to use ++x than to use x++.” • Really, with a modern compiler?

Two (toy) programs for(i = 0; i < (1 << 30); ++i) j = ++k; for(i = 0; i < (1 << 30); i++) j = k++; • ran each 200 times (interleaved) • mean runtimes were 2.835 and 2.735 • significant well below .05

What? leal -8(%ebp), %eax incl (%eax) movl -8(%ebp), %eax movl -8(%ebp), %eax leal -8(%ebp), %edx incl (%edx) %edx is not used anywhere else

Conclusion • Compile with –O and the assembly code is identical!

Why was this a dumb experiment?

Pre-increment, take 2 • Take gzip source code. • Replace all post-increments with pre-increments, in places where semantics won’t change. • Run on 1000 files, 10 times each. • Compare average runtime by file.

Sign test p = 8.5 x 10-8

Permutation test

Conclusion • Preincrementing is faster! • ... but what about –O? • sign test: p = 0.197 • permutation test: p = 0.672 • Preincrement matters without an optimizing compiler.

Joke.

Your programs ... • 8 students had a working program both weeks. • 6 people changed their code. • 1 person changed nothing. • 1 person changed to –O3. • 3 people lossy in week 1. • Everyone lossy in week 2!

Your programs! • Was there an improvement on compression between the two versions? • H0: No. • Find sampling distribution of difference in means, using permutations.

Student 1 (lossless week 1)

Compression < 1?

Student 2: worse compression

Compression < 1?

Student 3

Student 6

Student 7

Student 8

Homework Assignment 2 6 experiments: • Does your program compress text or images better? • What about variance of compression? • What about gzip’s compression? • Variance of gzip’s compression? • Was there a change in the compression of your program from week 1 to week 2? • In the runtime?

Remainder of the course • 11/9: EDA • 11/16: Regression and learning • 11/23: Happy Thanksgiving! • 11/30: Statistical debugging • 12/7: Review, Q&A • Saturday 12/17, 2-5pm: Exam

Empirical Research Methods in Computer Science