Evaluation and Methodology For Experimental Computer Science

Evaluation and MethodologyFor Experimental Computer Science Steve Blackburn Research School of Computer Science Australian National University

Research:Solving problems without known answers Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Quantitative Experimentation Quantitative Experimentation [Blackburn, Diwan, Hauswirth, Sweeney et al 2012] • Experiment • Measure A and B in context of C • Claim • “A is better than B” Does the experiment support the claim? Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Quantitative Experimentation Scope of Claim & Experiment [Blackburn, Diwan, Hauswirth, Sweeney et al 2012] • Claim with broad scope is hard to satisfy • “We improve Java programs by 10%” • Implicitly all Java programs in all circumstances • Scope of experiment limited by resources • Claim with narrow scope is uninteresting • “We improve Java on lusearch on an i7 on … by 10%” Scope of claim is the key tension Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Quantitative Experimentation Components of an Experiment [Blackburn, Diwan, Hauswirth, Sweeney et al 2012] • Measurement context • Software and hardware components varied or held constant • Workloads • Benchmarks and their inputs used in the experiment • Metrics • The properties to measure and how to measure them • Data analysis and interpretation • How to analyze the data and how to interpret the results Control / Independent Variables Dependent Variables Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Quantitative Experimentation Experimental Pitfalls (the four “I”s) [Blackburn, Diwan, Hauswirth, Sweeney et al 2012] • Inappropriate • Experiments that are inappropriate (or surplus) to the claim • Ignored • Elements relevant to claim, but omitted • Inconsistent • Elements are treated inconsistently • Irreproducible • Others cannot reproduce this experiment Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Quantitative Experimentation Components X Pitfalls [Blackburn, Diwan, Hauswirth, Sweeney et al 2012] ✗ ✗ ✗ ✗ A measurement context is inconsistent when an experiment compares two systems and uses different measurement contexts for each system. The different contexts may produce incomparable results for the two systems. Unfortunately, the more disparate the objects of comparison, the more difficult it is to ensure consistent measurement contexts. Even a benign-looking difference in contexts can introduce bias and make measurement results incomparable. For this reason, it may be important to randomize experimental parameters (e.g., memory layout in experiments that measure performance). If the measurement context is irreproducible then the experiment is also irreproducible. Measurement contexts may be irreproducible because either they are not public or they are not documented. A measurement context is inappropriate when it is flawed or does not reflect the measurement context that is implicit in the claim. This may become manifest as an error or as a distraction (a “red herring”). An aspect of the measurement context is ignored when an experiment design does not consider it even when it is necessary to support the claim. Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Advice Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Why Write? Ingrained, Systematic Skepticism Too good to be true? Probably. • Is the result repeatable? • If it is not, it’s nothing more than noise • Is the result plausible? • You need to posses a clear idea of what is plausible • Can you explain the result? • Plausible support of hypothesis is essential Street-Fighting Mathematics MITOPENCOURSEWARE 18.098 / 6.099 Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Why Write? Clean Environment Just as essential as a clean lab for a bio scientist • Clean OS & distro • All machines run same image of same distro • Clean hardware • Buy machines in pairs (redundancy & sanity checks) • Know what is running • No NFS mounts, no non-essential daemons • Machine reservation system • Ensure only you are using the machine Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Why Write? Repeatability and Accountability Disk is cheap. Don’t throw anything away. • All experiments should be scripted • Log every experiment • Capture the environment and output in log • Keep logs (forever) • Publish your raw data • Downloadable from your web site • If you’re not comfortable with this, you probably should not be publishing Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Why Write? Statistics Lies, damn lies, and statistics • Understand basic statistics • Are your results statistically significant? • Report confidence intervals Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Why Write? Good Tools Good evaluation infrastructure gives you an edge • Good data management system • Easy manipulation of and recovery of data • Good data analysis tools • See results that others can’t and share with your collaborators • Good workloads • Realistic workloads key to credibility • Good teamwork • Resist the temptation to write your own. Work as a team. Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Why Write? Good Tools Good evaluation infrastructure gives you an edge Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Mechanics Questions? Steve Blackburn | Evaluation and Methodology | PhD Workshop May 2012

Evaluation and Methodology For Experimental Computer Science

Evaluation and Methodology For Experimental Computer Science

Presentation Transcript

Experimental Methodology

Experimental Evaluation

Evaluation and ROMA methodology

MONITORING AND EVALUATION METHODOLOGY

Research Methods: Experimental Computer Science

Evaluation Methodologies in Computer Science

Experimental Frames methodology

CT3340/ CD5540 Research Methodology for Computer Science and Engineering History of Computer Science Gordana Dodig-Crnk

Monitoring, Evaluation and (Experimental) Impact Evaluation

AP Experimental Methodology

Evaluation Methodologies in Computer Science

Evaluation Methodology

Experimental Control Science Methodology, Algorithms, Solutions

CT3340 Research Methodology for Computer Science and Engineering Theory of Science

Evaluation Methodology

Experimental Economics Methodology

Experimental Evaluation in Computer Science: A Quantitative Study

Experimental Evaluation in Computer Science: A Quantitative Study

Experimental Methodology

Experimental Evaluation

Experimental Control Science Methodology, Algorithms, Solutions

Research Methods: Experimental Computer Science