5 Packing for reproducibility

At the end of a project, it’s a good idea to take all your great data and project management skills and make what’s called a research compendium or a reproducible package of all your work! This is a package that contains all of the things necessary to reproduce your work, taking even the computational environment into account.

So far in the course, we’ve addressed the top three layers of reproducibility. In this penultimate class, we’ll take a look at how we can achieve reproducibility fully, with the computational environment. We call this aspect of reproducibility computational reproducibility, and it’s in my opinion the biggest blocker of reproducibility in research. Our computers can’t talk to each other easily!

We see that even if research is rerunnable the results can be different!

The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements

We investigated the effects of data processing variables such as FreeSurfer version (v4.3.1, v4.5.0, and v5.0.0), workstation (Macintosh and Hewlett-Packard), and Macintosh operating system version (OSX 10.5 and OSX 10.6). Significant differences were revealed between FreeSurfer version v5.0.0 and the two earlier versions. […] About a factor two smaller differences were detected between Macintosh and Hewlett-Packard workstations and between OSX 10.5 and OSX 10.6.

Challenge: Environments are Hard to Capture

Gap: tools that can automatically capture all the dependencies in the original environment and automatically set them up in another environment.

There are a few tools that try to address this gap in slighlt different ways:

Containers: lightweight virtual operating systems you can send around to other people.

  • Singularity
  • Docker

Packaging Systems: auto-capture of dependencies & source code used at time of running.

  • ReproZip
  • o2r (takes an R workspace and gives you back a dockerfile!)
  • packrat (R library; only captures source code of R packages, not everything)

The research community has been increasingly using and sharing containers (especially Docker) to try to mitigate this problem. However there are a few problems with containers:

  • No idea of provenance. If I got a container and your research, I’d still need to know what to run first, which data is input/output, etc.
  • Not trivial for new users to make or to use; they have a steep learning curve.
  • Not sustainable; I can only use a Dockerfile with Docker. This is a big problem thinking long-term.

We are going to look at ReproZip today, which has a lower barrier to entry and captures richer information.

5.1 ReproZip

ReproZip is an open source tool for research reproducibility. It has been actively developed and maintained since 2012. It is a tool developed in Python (the tracing part in C) aimed at simplifying the process of creating reproducible… whatever. It can be research, it can be applications, it can be databases, it can be websites… if you can do it on a computer, chances are ReproZip can pack it!

ReproZip works in two parts: packing and unpacking. Basically, one researcher packs their research (currently, you can only pack on Linux) and an

Packing with ReproZip

Packing with ReproZip

5.1.1 Packing

The way ReproZip captures so much information is because it executes at the same time as a script, application, web server, whatever, and traces everything that the process touches, and then captures the source code, data, and metadata about those things.

To start, say I have a python script. It could be anything – ReproZip has packed websites with databases, interactive desktop applications – pretty much whatever you can run on a computer or server, ReproZip can trace it. So to run a python script normally, one would type into the terminal:

$ python myscript.py

To trace the process with ReproZip, you’d just prepend reprozip trace to the original command:

reprozip trace python myscript.py
Packing with ReproZip

Packing with ReproZip

If you want to run multiple scripts, you can have that in the same package as well using reprozip trace --continue [command]. When you have finished running everthing, you get a compressed .rpz file by typing:

reprozip pack package-name.rpz

This package is typically small, and can be sent in with a paper for others to seamlessly review your work, shared in a repository for others to use at will, or simply kept as an archival snapshot of your research.

The package contains all the workflow information, intense metadata about everything ReproZip traced, and all the data and source code necessary to reproduce the work.

Let’s look at the metadata file one minute: bechdel-config.yml

Packing with ReproZip currently only works on Linux, but is in development for mac OS. It’s harder on the locked down operating systems, because of the intensity of information ReproZip collects about the computational environment where the research is happening. Currently, you can only pack on the command line as well. But we have a student this summer who is building a user interface for packing, so hopefully that will help the process of packing as well.

5.1.2 Unpacking

Unpacking the .rpz works on any computer, with any operating system that has ReproUnzip installed. You simply get the .rpz file, double-click on it, and in two clicks in the user interface, you can re-excecute the original user’s work (you can use the terminal as well, but the graphical interface is nice too).

Unpacking with ReproZip

Unpacking with ReproZip

ReproUnzip works on the plugin model - so while it uses Docker and Vagrant as ‘unpackers’, it doesn’t rely on them. The .rpz is general enough that it can be used by the majority of virtual machines and container software. This summer, we have another student on making a Singularity unpacker too.

Unpacking isn’t just simple reproduction of the work though – you can extend the research via ReproUnzip too:

Secondary users who want to test a process with their own data can upload new inputs:

Uploading a new input file

Uploading a new input file

Other users looking to verify results, or just see the results, can also download all the output data right from ReproUnzip:

Downloading the output files

Downloading the output files

You can also visualize the workflow/execution the research using our provenance graph plugin (CLI only), or VisTrails, a scientific workflow management system (which, if we had more time, I’d get into how to use!).

5.1.2.1 ReproServer

We know that unpacking a .rpz locally So, we made ReproServer to complement ReproUnzip!

ReproServer runs ReproZip packages in the browser, no local software needed! You can either pass ReproServer a link to a ReproZip package, or upload one. It has all the functionality of ReproUnzip on your desktop. You can change configurations, see the execution of the .rpz, download uploads, and upload inputs.

We built this so you would have no lock-in: build on your laptop, pack automatically, reproduce anywhere. ReproServer even gives you a URL include in papers to reproduce your work, though the actual .rpz file should go into a repository (just pass ReproServer the link!).

Unpacking, now with ReproServer!

Unpacking, now with ReproServer!

We are going to test out ReproServer! Go there now: https://server.reprozip.org

CHALLENGE 1:

  1. Give the link to this .rpz file to ReproServer: https://osf.io/5ztp2/
  2. Go through the steps on ReproServer to reproduce the work.
  3. BONUS: upload a new input – photo_2.jpg
  4. BONUS: download the new output image to your computer.
  5. Raise your hand to show you’ve finished!


5.2 BONUS: Pack Jupyter Notebooks

We also have a plugin to pack jupyter notebooks, which we couldn’t get on the Jupyter Notebook class installation in time unfortunately.

On the terminal, you run the following commands in order:

pip install reprozip-jupyter
jupyter nbextension install --py reprozip_jupyter --user
jupyter nbextension enable --py reprozip_jupyter --user
jupyter serverextension enable --py reprozip_jupyter --user

Then, the next time you open Jupyter notebooks, you should have a little button which automatically runs the notebook from top-to-bottom, and packs it with ReproZip!

5.3 CONGRATS

You know more about computational reproducibility AND reproduced someone else’s work with a ReproZip package!

5.4 Further Reading

ReproZip: