We investigated the effects of data processing variables such as FreeSurfer version (v4.3.1, v4.5.0, and v5.0.0), workstation (Macintosh and Hewlett-Packard), and Macintosh operating system version (OSX 10.5 and OSX 10.6). Significant differences were revealed betweenFreeSurfer version v5.0.0 and the two earlier versions. [...] About a factor two smaller differences were detected betweenMacintosh and Hewlett-Packard workstationsand betweenOSX 10.5 and OSX 10.6.
Workload & Time Challenges It is a time commitment to get data and code ready to share, and to share it
Otherwise known as...
the Incentive Problem Reproducibility takes time, and is not always valued by the academic reward structure
"Insufficient time is the main reason why scientists do not make their data and experiment available and reproducible." Carol Tenopir, Beyond the PDF2 Conference
"77% claim that they do not have time to document and clean up the code." Victoria Stodden, Survey of the Machine Learning Community – NIPS 2010
ReproZip tries to solve...
Technical Obsolescence Technology changes affect the reproducibility
Normative Dissonance1 Espoused values don’t always match practice
Otherwise known as...
the Pipeline Problem Reproducibility requires skills that are not included in most curriculums!
"It would require huge amount of effort to make our code work with the latest versions of these tools." Collberg et al., Repeatability and Benefaction in Computer Systems Research, University of Arizona TR 14-04
But the article, code, and data are really just the tip of the iceberg
What does this mean for...
Librarians
We now have to help researchers license code + data + computer environments, select or build repositories to reliably store those objects, and preserve them forever.
Researchers
They have to clean up code and code, learn how to capture computer environments and make it shareable, without spending all their time + research budget on it.
But what if I told you that you could put code + data + applications + environment in one small file that has a ton of automatically captured metadata and provenance?
Tool to Help: ReproZip!
ReproZip is a tool aimed at simplifying the process of creating reproducible...whatever. It can be research, it can be applications, it can be databases, it can be websites...if you can do it on a computer, chances are we can pack it!
2 Steps to Reproducibility
Step 1: Trace & Pack
reprozip trace [command]
reprozip pack package-name.rpz
Before you pack, you can edit the config.yml (but it's not recommended). This is what a config looks like.
Step 2: Set up & Run
Double click on the RPZ file, and choose your unpacker!
Not just simple reproduction...
When you unpack your .rpz package with the GUI, you'll come to this screen.
Download the results or add your own input!
Download Ouput
Upload New Inputs
ReproZip can pack
Data analysis scripts / software (any language, you name it!)
Graphical tools
Interactive tools
Client-server applications (including databases)
Jupyter notebooks
MPI experiments (setting up the experiment is involved though…)
Liaison libs have an excellent opportunity as the first line of contact with patrons to encourage and disseminate information on reproducible practices (e.g. using ReproZip).
The library gains an excellent way to build collections of diverse, preservation-ready research outputs, and patrons easily adopt reproducible practices.
Data Services
Adding ReproZip to the curriculum of data services classes and workshops, and to the list of supported software, data services teams can become a center on campus for assisting the reproducibility of their patrons' scholarship.
Potential for ReproZip in (Digital) Libraries
Digital Libraries
ReproZip isn't relient on Docker or Vagrant; it uses a plugin model for unpackers, so new ones can be added for forward compatibility.
If no containers or VMs exist in the future, then the archivist can still read and use the robust technical and administrative metadata in ReproZip's config.yml.
Repository Management
ReproZip contains extensive technical and administrative metadata, which can be an be exported a json file, which allows for extensible models of metadata – such as crosswalking to Resource Description Framework (RDF) or Dublin Core, automating ingest workflows.
ReproZip plugin for Jupyter Notebooks (beginning summer 2017)
Better MPI/HPC support
Conclusion
ReproZip is extensible enough to be used for reproducibility across research domains as well as across library services.
Because ReproZip is open-source software, the community drives its development – others can contribute, modify, and reuse it for a variety of purposes.
The library community can leverage ReproZip in instruction, consultation, repository services, digital archiving, and in their own research.
Prof. Juliana Freire, ReproZip PI and reproducibility master.
Dr. Nicholas Wolf for his help in editing our paper.
The Gordon and Betty Moore Foundation & the Alfred P. Sloan Foundation, who support The Moore-Sloan Data Science Environment at NYU, which was vital to the development of ReproZip.