1 / 58

Reproducibility: On computational processes, dynamic data, and why we should bother

Explore the challenges in achieving reproducibility, the importance of reproducibility, and strategies to address complex processes and big data. Gain insights from research studies and learn about the benefits of reproducibility.

schull
Download Presentation

Reproducibility: On computational processes, dynamic data, and why we should bother

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reproducibility: On computational processes, dynamic data, and why we should bother

  2. Outline What are the challenges in reproducibility? What do we gain from reproducibility?(and: why is non-reproducibility interesting?) How to address the challenges of complex processes? How to deal with “Big Data”? Summary

  3. Challenges in Reproducibility http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234

  4. Challenges in Reproducibility Excursion: Scientific Processes

  5. Challenges in Reproducibility Excursion: scientific processes set1_freq440Hz_Am11.0Hz set1_freq440Hz_Am12.0Hz set1_freq440Hz_Am05.5Hz Java Matlab

  6. Challenges in Reproducibility Excursion: Scientific Processes • Bug? • Psychoacoustic transformation tables? • Forgetting a transformation? • Different implementation of filters? • Limited accuracy of calculation? • Difference in FFT implementation? • ...?

  7. Challenges in Reproducibility • Workflows Taverna

  8. Challenges in Reproducibility

  9. Challenges in Reproducibility • Large scale quantitative analysis • Obtain workflows from MyExperiments.org • March 2015: almost 2.700 WFs (approx. 300-400/year) • Focus on Taverna 2 WFs: 1.443 WFs • Published by authors  should be „better quality“ • Try to re-execute the workflows • Record data on the reasons for failure along • Analyse the most common reasons for failures

  10. Challenges in Reproducibility Re-Execution results • Majority of workflows fails • Only 23.6 % are successfully executed • No analysis yet on correctness of results… Rudolf Mayer, Andreas Rauber, “A Quantitative Study on the Re-executability of Publicly Shared Scientific Workflows”, 11th IEEE Intl. Conference on e-Science, 2015.

  11. Computer Science • 613 papers in 8 ACM conferences • Process • download paper and classify • search for a link to code (paper, web, email twice) • download code • build and execute Christian Collberg and Todd Proebsting. “Repeatability in Computer Systems Research,” CACM 59(3):62-69.2016

  12. Challenges in Reproducibility In a nutshell – and another aspect of reproducibility: Source: xkcd

  13. Outline What are the challenges in reproducibility? What do we gain by aiming for reproducibility? How to address the challenges of complex processes? How to deal with dynamic data? Summary

  14. Reproducibility – solved! (?) • Provide source code, parameters, data, … • Wrap it up in a container/virtual machine, … … • Why do we want reproducibility? • Which levels or reproducibility are there? • What do we gain by different levels of reproducibility? LXC

  15. Reproducibility – solved! (?) • Dagstuhl Seminar:Reproducibility of Data-Oriented Experiments in e-ScienceJanuary 2016, Dagstuhl, Germany

  16. Types of Reproducibility • The PRIMAD1 model: which attributes can we “prime”? • Data • Parameters • Input data • Plattform • Implementation • Method • Research Objective • Actors • What do we gain by priming one or the other? [1] Juliana Freire, Norbert Fuhr, and Andreas Rauber. Reproducibility of Data-Oriented Experiments in eScience. DagstuhlReports, 6(1), 2016.

  17. Types of Reproducibility and Gains

  18. Reproducibility Papers • Aim for reproducibility: for one’s own sake – and as Chairs of conference tracks, editor, reviewer, superviser, … • Review of reproducibility of submitted work (material provided) • Encouraging reproducibility studies • (Messages to stakeholders in Dagstuhl Report) • Consistency of results, not identity! • Reproducibility studies and papers • Not just re-running code / a virtual machine • When is a reproducibility paper worth the effort / worth being published?

  19. Reproducibility Papers • When is a Reproducibility paper worth being published?

  20. Learning from Non-Reproducibility • Do we always want reproducibility? • Scientifically speaking: yes! • Research is addressing challenges: • Looking for and learning from non-reproducibility! • Non-reproducibility if • Some (un-known) aspect of a study influences results • Technical: parameter sweep, bug in code, OS, … -> fix it! • Non-technical: input data! (specifically: “the user”)

  21. Learning from Non-Reproducibility Challenges in MIR – “things don’t seem to work” • Virtual Box, Github, <your favourite tool> are starting points • Same features, same algorithm, different data -> • Same data, different listeners -> • Understanding “the rest”: • Isolating unknown influence factors • Generating hypotheses • Verifying these to understand the “entire system”, cultural and other biases, … • Benchmarks and Meta-Studies

  22. Outline What are the challenges in reproducibility? What do we gain by aiming for reproducibility? How to address the challenges of complex processes? How to deal with “Big Data”? Summary

  23. Deja-vue… http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234

  24. And the solution is… Standardization and Documentation Standardized components, procedures, workflows Documenting complete system set-up across entire provenance chain How to do this – efficiently? Alexander Graham Bell’s Notebook, March 9 1876 https://commons.wikimedia.org/wiki/File:Alexander_Graham_Bell's_notebook,_March_9,_1876.PNG Pieter Bruegel the Elder: De Alchemist (British Museum, London)

  25. Documenting a Process • Context Model: establish what to document and how • Meta-model for describing process & context • Extensible architecture integrated by core model • Reusing existing models as much as possible • Based on ArchiMate, implemented using OWL • Extracted by static and dynamic analysis

  26. Context Model – Static Analysis • Analyses steps, platforms, services, tools called • Dependencies (packages, libraries) • HW, SW Licenses, … #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex Script Context Model(OWL ontology) ArchiMate model Taverna Workflow

  27. Context Model – Dynamic Analysis • Process Migration Framework (PMF) • designed for automatic redeployments into virtual machines • uses strace to monitor system calls • complete log of all accessed resources (files, ports) • captures and stores process instance data • analyse resources (file formats via PRONOM, PREMIS)

  28. Context Model – Dynamic Analysis Taverna Workflow

  29. Process Capture • Preservationand Re-deployment • „Encapsulate“ ascomplex Research Object (RO) • DP: Re-Deploymentbeyond original environment • Format migrationofelementsof ROs • Cross-compilationofcode • Emulation-as-a-Service • Verification upon re-deployment

  30. VFramework Original environment Repository Redeployment environment Preserve Redeploy Are these processes the same?

  31. VFramework

  32. VFramework #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex

  33. VFramework #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex

  34. VFramework #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex

  35. VFramework #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex

  36. VFramework ADDED #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatexiq.tex pdflatexiq.tex NOT USED

  37. Outline What are the challenges in reproducibility? What do we gain by aiming for reproducibility? How to address the challenges of complex processes? How to deal with “Big Data”? Summary

  38. Data and Data Citation • So far focus on the process • Processes work with data • Data as a “1st-class citizen” in science • We need to be able to • preserve data and keep it accessible • cite data to give credit and show which data was used • identifypreciselythe data used in a study/process for reproducibility, evaluating progress,… • Why is this difficult?(after all, it’s being done…)

  39. Data and Data Citation • Common approaches to data management…(from PhD Comics: A Story Told in File Names, 28.5.2010) Source: http://www.phdcomics.com/comics.php?f=1323

  40. Identification of Dynamic Data • Citable datasets have to be static • Fixed set of data, no changes:no corrections to errors, no new data being added • But: (research) data is dynamic • Adding new data, correcting errors, enhancing data quality, … • Changes sometimes highly dynamic, at irregular intervals • Current approaches • Identifying entire data stream, without any versioning • Using “accessed at” date • “Artificial” versioning by identifying batches of data (e.g. annual), aggregating changes into releases (time-delayed!) • Would like to identify precisely the data as it existed at a specific point in time

  41. Granularity of Data Identification • What about the granularity of data to be identified? • Databases collect enormous amounts of data over time • Researchers use specific subsets of data • Need to identify precisely the subset used • Current approaches • Storing a copy of subset as used in study -> scalability • Citing entire dataset, providing textual description of subset-> imprecise (ambiguity) • Storing list of record identifiers in subset -> scalability, not for arbitrary subsets (e.g. when not entire record selected) • Would like to be able to identify precisely the subset of (dynamic) data used in a process

  42. RDA WG Data Citation • Research Data Alliance • WG on Data Citation:Making Dynamic Data Citeable • WG officially endorsed in March 2014 • Concentrating on the problems of large, dynamic (changing) datasets • Focus! Identification of data!Not: PID systems, metadata, citation string, attribution, … • Liaise with other WGs and initiatives on data citation (CODATA, DataCite, Force11, …) • https://rd-alliance.org/working-groups/data-citation-wg.html

  43. Making Dynamic Data Citeable Data Citation: Data + Means-of-access • Data  time-stamped & versioned (aka history) Researcher creates working-set via some interface: • Access assign PID to QUERY, enhanced with • Time-stamping for re-execution against versioned DB • Re-writing for normalization, unique-sort, mapping to history • Hashing result-set: verifying identity/correctness leading to landing page • Andreas Rauber, Ari Asmi, Dieter van Uytvanck and Stefan Proell. Identification of Reproducible Subsets for Data Citation, Sharing and Re-Use.Bulletin of IEEE Technical Committee on Digital Libraries (TCDL), vol. 12, 2016http://www.ieee-tcdl.org/Bulletin/current/papers/IEEE-TCDL-DC-2016_paper_1.pdf • Stefan Pröll and Andreas Rauber. Scalable Data Citation in Dynamic Large Databases: Model and Reference Implementation. In IEEE Intl. Conf. on Big Data 2013 (IEEE BigData2013), 2013http://www.ifs.tuwien.ac.at/~andi/publications/pdf/pro_ieeebigdata13.pdf • Prototype for CSV: http://datacitation.eu/

  44. Data Citation – Deployment • Researcher uses workbench to identify subset of data • Upon executing selection („download“) user gets • Data (package, access API, …) • PID (e.g. DOI) (Query is time-stamped and stored) • Hash value computed over the data for local storage • Recommended citation text (e.g. BibTeX) • PID resolves to landing page • Provides detailed metadata, link to parent data set, subset,… • Option to retrieve original data OR current version OR changes • Upon activating PID associated with a data citation • Query is re-executed against time-stamped and versioned DB • Results as above are returned • Query store aggregates data usage

  45. Data Citation – Deployment Note: querystringprovidesexcellentprovenanceinformation on thedataset! • Researcher uses workbench to identify subset of data • Upon executing selection („download“) user gets • Data (package, access API, …) • PID (e.g. DOI) (Query is time-stamped and stored) • Hash value computed over the data for local storage • Recommended citation text (e.g. BibTeX) • PID resolves to landing page • Provides detailed metadata, link to parent data set, subset,… • Option to retrieve original data OR current version OR changes • Upon activating PID associated with a data citation • Query is re-executed against time-stamped and versioned DB • Results as above are returned • Query store aggregates data usage

  46. Data Citation – Deployment Note: querystringprovidesexcellentprovenanceinformation on thedataset! • Researcher uses workbench to identify subset of data • Upon executing selection („download“) user gets • Data (package, access API, …) • PID (e.g. DOI) (Query is time-stamped and stored) • Hash value computed over the data for local storage • Recommended citation text (e.g. BibTeX) • PID resolves to landing page • Provides detailed metadata, link to parent data set, subset,… • Option to retrieve original data OR current version OR changes • Upon activating PID associated with a data citation • Query is re-executed against time-stamped and versioned DB • Results as above are returned • Query store aggregates data usage This is an importantadvantageovertraditional approachesrelying on, e.g. storing a listofidentifiers/DB dump!!!

  47. Data Citation – Deployment Note: querystringprovidesexcellentprovenanceinformation on thedataset! • Researcher uses workbench to identify subset of data • Upon executing selection („download“) user gets • Data (package, access API, …) • PID (e.g. DOI) (Query is time-stamped and stored) • Hash value computed over the data for local storage • Recommended citation text (e.g. BibTeX) • PID resolves to landing page • Provides detailed metadata, link to parent data set, subset,… • Option to retrieve original data OR current version OR changes • Upon activating PID associated with a data citation • Query is re-executed against time-stamped and versioned DB • Results as above are returned • Query store aggregates data usage This is an importantadvantageovertraditional approachesrelying on, e.g. storing a listofidentifiers/DB dump!!! Identifywhichpartsofthedataareused. Ifdatachanges, identifywhichqueries (studies) areaffected

  48. Data Citation – Output • 14 Recommendationsgrouped into 4 phases: • Preparing data and query store • Persistently identifying specific data sets • Resolving PIDs • Upon modifications to the data infrastructure • 2-page flyer https://rd-alliance.org/system/files/documents/RDA-DC-Recommendations_151020.pdf • More detailed Technical Report:http://www.ieee-tcdl.org/Bulletin/current/papers/IEEE-TCDL-DC-2016_paper_1.pdf • Reference implementations(SQL, CSV, XML) and Pilots

  49. Join RDA and Working Group If you are interested in joining the discussion, contributing a pilot, wish to establish a data citation solution, … Register for the RDA WG on Data Citation: Website:https://rd-alliance.org/working-groups/data-citation-wg.html Mailinglist: https://rd-alliance.org/node/141/archive-post-mailinglist Web Conferences:https://rd-alliance.org/webconference-data-citation-wg.html List of pilots:https://rd-alliance.org/groups/data-citation-wg/wiki/collaboration-environments.html

  50. 3 Take-Away Messages Message 1 Aim at achieving reproducibility at different levels Re-run, ask others to re-run Re-implement Port to different platforms Test on different data, vary parameters(and report!) If something is not reproducible -> investigate!(you might be onto something!) Encourage reproducibility studies!

More Related