Reproducible Research

Reproducible Research Jonathon LeFaive, University of Michigan Big Data Summer Institute June 18, 2019

Replicating vs. Reproducing 2019 Big Data Summer Institute

Replication • “Replication is the ultimate standard by which scientific claims are judged”—Roger Peng • Come to same conclusions with different data/protocols • Replicable = results are consistent and reliable 2019 Big Data Summer Institute

Reproduction • “An attainable minimum standard for assessing the value of scientific claims, particularly when full independent replication of a study is not feasible” –Roger Peng • Obtain same results with same data and same protocols • Reproducible = methods are consistent and reliable 2019 Big Data Summer Institute

Reproducible != True 2019 Big Data Summer Institute

Motivation for reproducibility 2019 Big Data Summer Institute

The reproducibility crisis 2019 Big Data Summer Institute

Why make my work reproducible? • Transparency • Save time • Keep track of complex projects http://science.sciencemag.org/content/334/6060/1226.full • Higher impact • More efficient collaboration 2019 Big Data Summer Institute

The Reproducibility Toolkit • Well-organized projects • Documentation • Version control • Capturing the computational environment • Automation 2019 Big Data Summer Institute

Organizing your projects 2019 Big Data Summer Institute

Exercise: Evaluate the structure of past projects • On your own: • Find a project you have worked on in the past • Sketch out the directory structure • (If applicable) sketch out the flow of information • Write down any naming conventions you used for files/folders • With a partner: • A collaborator wants to reproduce your project—what instructions would you give them? • What would you change about your organizational structure to make this easier? http://pgbovine.net/research-directory-structure.htm 2019 Big Data Summer Institute

Challenges of organization • Developing an intuitive directory structure • Coming up with good names for things • How long/when/where to keep intermediate data files? • Dealing with clutter of old stuff/temp files • Keeping track of dependencies • Managing backups/previous versions • Collaborating effectively with shared project content • Different local versions, determining structure of shared directory, content spread across computers, cloud apps, etc. 2019 Big Data Summer Institute

Tips for organizing your projects • Develop your own system and be consistent! • Think about where to put files you haven’t yet created • Separate raw from processed data • Separate code from data • Use file shortcuts to avoid unnecessary duplicates • File/folder names and paths should be self-explanatory • Code as verbs, data as nouns 2019 Big Data Summer Institute

Documentation 2019 Big Data Summer Institute

Documentation techniques • Including README files in project directories • Source code comments • Self-documenting code • Literate programming 2019 Big Data Summer Institute

Documenting code n = (icnt + gs - 1) / gs; uintceil_divide(uintx, uinty){return (x + y - 1) / y;}group_count = ceil_divide(item_count, max_group_size); /* Calculatesnumber of groupsbyceildividingitemcountbymaxgroupsize. */group_count = (item_count + max_group_size - 1) / max_group_size; group_count = (item_count + max_group_size - 1) / max_group_size; 2019 Big Data Summer Institute

Doxygen #' Adds together two numbers.#' #' @param x Left hand number.#' @param y Right hand number.#' @return The sum of x and y.#' @examples#' add(1, 1)#' add(10, 1)add <- function(x, y) {x + y} 2019 Big Data Summer Institute

“Literate programming is a methodology that combines a programming language with a documentation language, thereby making programs more robust, more portable, more easily maintained, and arguably more fun to write than programs that are written only in a high-level language.” --Donald Knuth 2019 Big Data Summer Institute

Literate programming • Literate programming allows you to encapsulate & share every aspect of your analysis in an interactive and descriptive way: • Data provenance • Dependencies • Code • Documentation • Tables/Figures/References etc. http://www.datacarpentry.org/rr-literate-programming/02-literate-programming/ 2019 Big Data Summer Institute

R Markdown • Literate programming language implemented in RStudio • Integrates standard R code with Markdown text formatting (+other stuff!) • knitr: Typesetting package to render your R Markdown document as a web page, PDF, Word document, etc. 2019 Big Data Summer Institute

Structure of R Markdown documents YAML-formatted header Markdown-formatted text https://blogs.uoregon.edu/rclub/2016/04/26/r-markdown-resources/ Code chunk 2019 Big Data Summer Institute

A rendered R Markdown document (HTML) Information from YAML header Text https://blogs.uoregon.edu/rclub/2016/04/26/r-markdown-resources/ Evaluated R code 2019 Big Data Summer Institute

R Notebooks • Like R Markdown documents, but you can evaluate code chunks independently, without needing to render the entire document 2019 Big Data Summer Institute

Create a new R Markdown document or an R Notebook Create a new document/notebook & try commands listed in tutorial 2019 Big Data Summer Institute

Jupyter notebooks • Similar to R Notebooks, a literate & interactive programming environment combining Markdown documentation with code chunks • popular with Python users • Supports many different kernels (e.g., you can run R code in Jupyter notebooks) 2019 Big Data Summer Institute

Version Control 2019 Big Data Summer Institute

Version control gone wrong 2019 Big Data Summer Institute

Version control system (VCS) • System for managing changes to files • No more duplicate files! • Full history of revisions is accessible • Revisions can be compared, restored, and merged • Multiple team members can edit files; VCS handles merging 2019 Big Data Summer Institute

Git: a flexible VCS • Git is a distributed version control system • i.e., does not require central server • Flexible branching design • Faster than non-distributed systems 2019 Big Data Summer Institute

Cloud-based VCS hosting platform • Central repository for storing Git-enabled projects • Issue tracking, wiki, package hosting, … • Extremely popular and community-oriented! • >36 million users, >100 million code repositories 2019 Big Data Summer Institute

Setting up Github • Create a GitHub account: https://github.com/join • Windows users: install GitHub Desktop https://desktop.github.com/ • This will install Git Shell on your PC—check that you have this in your programs • Mac/Linux users: type git in your terminal (Windows users use Git Shell)—should see usage info; if not, should be prompted to install • Detailed instructions:https://help.github.com/articles/set-up-git/ 2019 Big Data Summer Institute

Git terminology • A repositoryis a directory containing your project files and the Git metadata • A branchis a development arm of a project • A commit is a checkpoint for changes you have made • Your commits are added to a remote repository by pushing • You can grab someone else’s revisions by pulling commit branch repository 2019 Big Data Summer Institute

History tracking • Git tracks every committed change—you can access and revert to any commit in the repository’s history • No need to manually save old versions! 2019 Big Data Summer Institute

Basic Git commands git initgit add <source_file>git statusgit diff <file|directory>git commitgit log git remote add origin <github_url>git push –u origin master git pushgit pull 2019 Big Data Summer Institute

Git demo 2019 Big Data Summer Institute

Controlling the Computational Environment 2019 Big Data Summer Institute

Capturing the computational environment • Computational projects for big data can be extremely complex, and make use of multiple languages, libraries, and dependencies • Reproducibility encompasses the entire computational environment, not just your data & code • Debugging dependency issues can be time-consuming & impede reproducibility 2019 Big Data Summer Institute

Package managers name: my-project channels: - conda-forge - bioconda dependencies: - cyvcf2=0.8 - pyfaidx=0.5 - joblib=0.11 • Language specific • Pip, cget, Cargo, npm, etc. • Built-in (R, Julia, etc.) • System specific • apt, rpm, MacPorts, etc. • Conda 2019 Big Data Summer Institute

Conda environments • Conda is a cross-platform package manager, very popular with Python users • All dependencies (OS applications, Python libraries, R packages, etc.) can be specified in a YAML file conda env create --name "my-env" -f env.yml conda activate my-env conda deactivate • Provides an easy way for other users to reproduce your analyses with exactly the same dependencies 2019 Big Data Summer Institute

Containers (Singularity/Docker) • Frameworks for isolating computational environments & applications in standalone virtual “containers” • Containers include a barebones operating system and all dependencies specified by the creator • Containers can run on nearly any system, ensuring your code is widely reproducible 2019 Big Data Summer Institute

Conda & Singularity demo 2019 Big Data Summer Institute

What about hardware? • Conda environments & Docker containers only control the OS & software environment/dependencies • Hardware requirements are harder to control—e.g., not everyone can reproduce analyses that require 100s of CPUs • Best practices: • Document specific hardware environment you used (e.g., processor model, available RAM, required disk space) • Keep track of program performance (e.g., RAM usage, runtime, CPUs required) • Always provide your code, even if it’s not runnable on every system 2019 Big Data Summer Institute

Automation 2019 Big Data Summer Institute

Why automate your workflow? • Complex projects are often spread across several stages of data (pre)processing and analyses • Not only must each stage be reproducible, but also the flow of information between stages • Automation tools ensure complex workflows are reproduced in the proper order • When data or code changes, automation ensures all steps are followed in proper order and all downstream data is updated properly 2019 Big Data Summer Institute

Tools for automation • GNU make (build system) • Workflow engines • Ssnakemake, Nextflow, CWL, etc. • A series of “recipes” for controlling the flow of your analyses and managing data dependencies 2019 Big Data Summer Institute

Snakemake Example rulestep_one: output: "tmp/{letter}.txt"shell: "echo 'Hello' > {output}"rulestep_two:input:"tmp/a.txt", "tmp/b.txt", "tmp/c.txt"output:"out/merged.txt"shell:"cat {input} > out/merged.txt" 2019 Big Data Summer Institute

Snakemake demo 2019 Big Data Summer Institute

Summary • Starting with a mindset of reproducibility will pay off in the future! • Pay attention to project organization • Convey the ideas behind your code with documentation • Take advantage of version control systems • Capture environment and software dependencies • Automate whenever possible 2019 Big Data Summer Institute

Things you can do today • Think through how to effectively organize your projects and create a folder structure • Install Rstudio, Conda & Jupyter and try to reproduce some examples • Create a Github account and add your notes, code, etc. to a repository • Install Singularity or Docker and try to containerize your software • Install Snakemake (or other engine) and explore how you can port existing pipelines into a portable workflow 2019 Big Data Summer Institute

Reproducible Research