Empirical Research Methods in Computer Science
510 likes | 685 Views
Empirical Research Methods in Computer Science. Lecture 4 November 2, 2005 Noah Smith. Today. Review bootstrap estimate of se (from homework). Review sign and permutation tests for paired samples. Lots of examples of hypothesis tests. Recall.
Empirical Research Methods in Computer Science
E N D
Presentation Transcript
Empirical Research Methods in Computer Science Lecture 4 November 2, 2005 Noah Smith
Today • Review bootstrap estimate of se (from homework). • Review sign and permutation tests for paired samples. • Lots of examples of hypothesis tests.
Recall ... • There is a true value of the statistic. But we don’t know it. • We can compute the sample statistic. • We know sample means are normally distrubuted (as n gets big):
But we don’t know anything about the distribution of other sample statistics (medians, correlations, etc.)!
Bootstrap world unknown distribution F empirical distribution observed random sample X bootstrap random sample X* statistic of interest bootstrap replication statistics about the estimate (e.g., standard error)
Bootstrap estimate of se • Run B bootstrap replicates, and compute the statistic each time: θ*[1], θ*[2], θ*[3], ..., θ*[B] (mean of θ* across replications) (sample standard deviation of θ* across replications)
Paired-Sample Design • pairs (xi, yi) • x ~ distribution F • y ~ distribution G • How do F and G differ?
Sign Test • H0: F and G have the same median median(F) – median(G) = 0 • Pr(x > y) = 0.5 • sign(x – y) ~ binomial distribution • compute bin(N+, 0.5)
Sign Test • nonparametric (no assumptions about the data) • closed form (no random sampling)
Example: gzip speed • build gzip with –O2 or with –O0 on about 650 files out of 1000, gzip-O2 was faster binomial distribution, p = 0.5, n = 1000 p < 3 x 10-24
Permutation Test • H0: F = G • Suppose difference in sample means is d. • How likely is this difference (or a greater one) under H0? • For i = 1 to P • Randomly permute each (xi, yi) • Compute difference in sample means
Permutation Test • nonparametric (no assumptions about the data) • randomized test
Example: gzip speed 1000 permutations: difference of sample means under H0 is centered on 0 -1579 is very extreme; p ≈ 0
Comparing speed is tricky! • It is very difficult to control for everything that could affect runtime. • Solution 1: do the best you can. • Solution 2: many runs, and then do ANOVA tests (or their nonparametric equivalents). “Is there more variance between conditions than within conditions?”
Sampling method 1 • for r = 1 to 10 • for each file f • for each program p • time p on f
Result (gzip first) student 2’s program faster than gzip!
Result (student first) student 2’s program is slower than gzip!
Sampling method 1 • for r = 1 to 10 • for each file f • for each program p • time p on f
Order effects • Well-known in psychology. • What the subject does at time t will affect what she does at time t+1.
Sampling method 2 • for r = 1 to 10 • for each program p • for each file f • time p on f
Result gzip wins
Sign and Permutation Tests all distribution pairs (F, G) F G median(F) median(G)
Sign and Permutation Tests all distribution pairs (F, G) F G median(F) median(G) sign test rejects H0
Sign and Permutation Tests all distribution pairs (F, G) F G permutation test rejects H0 median(F) median(G)
Sign and Permutation Tests all distribution pairs (F, G) F G permutation test rejects H0 median(F) median(G) sign test rejects H0
There are other tests! • We have chosen two that are • nonparametric • easy to implement • Others include: • Wilcoxon Signed Rank Test • Kruskal-Wallis (nonparametric “ANOVA”)
Pre-increment? • Conventional wisdom: “Better to use ++x than to use x++.” • Really, with a modern compiler?
Two (toy) programs for(i = 0; i < (1 << 30); ++i) j = ++k; for(i = 0; i < (1 << 30); i++) j = k++; • ran each 200 times (interleaved) • mean runtimes were 2.835 and 2.735 • significant well below .05
What? leal -8(%ebp), %eax incl (%eax) movl -8(%ebp), %eax movl -8(%ebp), %eax leal -8(%ebp), %edx incl (%edx) %edx is not used anywhere else
Conclusion • Compile with –O and the assembly code is identical!
Pre-increment, take 2 • Take gzip source code. • Replace all post-increments with pre-increments, in places where semantics won’t change. • Run on 1000 files, 10 times each. • Compare average runtime by file.
Sign test p = 8.5 x 10-8
Conclusion • Preincrementing is faster! • ... but what about –O? • sign test: p = 0.197 • permutation test: p = 0.672 • Preincrement matters without an optimizing compiler.
Your programs ... • 8 students had a working program both weeks. • 6 people changed their code. • 1 person changed nothing. • 1 person changed to –O3. • 3 people lossy in week 1. • Everyone lossy in week 2!
Your programs! • Was there an improvement on compression between the two versions? • H0: No. • Find sampling distribution of difference in means, using permutations.
Homework Assignment 2 6 experiments: • Does your program compress text or images better? • What about variance of compression? • What about gzip’s compression? • Variance of gzip’s compression? • Was there a change in the compression of your program from week 1 to week 2? • In the runtime?
Remainder of the course • 11/9: EDA • 11/16: Regression and learning • 11/23: Happy Thanksgiving! • 11/30: Statistical debugging • 12/7: Review, Q&A • Saturday 12/17, 2-5pm: Exam