5 Packing for reproducibility
At the end of a project, it’s a good idea to take all your great data and project management skills and make what’s called a research compendium or a reproducible package of all your work! This is a package that contains all of the things necessary to reproduce your work, taking even the computational environment into account.
So far in the course, we’ve addressed the top three layers of reproducibility. In this penultimate class, we’ll take a look at how we can achieve reproducibility fully, with the computational environment. We call this aspect of reproducibility computational reproducibility, and it’s in my opinion the biggest blocker of reproducibility in research. Our computers can’t talk to each other easily!
We see that even if research is rerunnable the results can be different!
We investigated the effects of data processing variables such as FreeSurfer version (v4.3.1, v4.5.0, and v5.0.0), workstation (Macintosh and Hewlett-Packard), and Macintosh operating system version (OSX 10.5 and OSX 10.6). Significant differences were revealed between FreeSurfer version v5.0.0 and the two earlier versions. […] About a factor two smaller differences were detected between Macintosh and Hewlett-Packard workstations and between OSX 10.5 and OSX 10.6.
Challenge: Environments are Hard to Capture
Gap: tools that can automatically capture all the dependencies in the original environment and automatically set them up in another environment.
There are a few tools that try to address this gap in slighlt different ways:
Containers: lightweight virtual operating systems you can send around to other people.
- Singularity
- Docker
Packaging Systems: auto-capture of dependencies & source code used at time of running.
- ReproZip
- o2r (takes an R workspace and gives you back a dockerfile!)
- packrat (R library; only captures source code of R packages, not everything)
The research community has been increasingly using and sharing containers (especially Docker) to try to mitigate this problem. However there are a few problems with containers:
- No idea of provenance. If I got a container and your research, I’d still need to know what to run first, which data is input/output, etc.
- Not trivial for new users to make or to use; they have a steep learning curve.
- Not sustainable; I can only use a Dockerfile with Docker. This is a big problem thinking long-term.
We are going to look at ReproZip today, which has a lower barrier to entry and captures richer information.
5.1 ReproZip
ReproZip is an open source tool for research reproducibility. It has been actively developed and maintained since 2012. It is a tool developed in Python (the tracing part in C) aimed at simplifying the process of creating reproducible… whatever. It can be research, it can be applications, it can be databases, it can be websites… if you can do it on a computer, chances are ReproZip can pack it!
ReproZip works in two parts: packing and unpacking. Basically, one researcher packs their research (currently, you can only pack on Linux) and an

Packing with ReproZip
5.1.1 Packing
The way ReproZip captures so much information is because it executes at the same time as a script, application, web server, whatever, and traces everything that the process touches, and then captures the source code, data, and metadata about those things.
To start, say I have a python script. It could be anything – ReproZip has packed websites with databases, interactive desktop applications – pretty much whatever you can run on a computer or server, ReproZip can trace it. So to run a python script normally, one would type into the terminal:
$ python myscript.py
To trace the process with ReproZip, you’d just prepend reprozip trace
to the original command:
reprozip trace python myscript.py

Packing with ReproZip
If you want to run multiple scripts, you can have that in the same package as well using reprozip trace --continue [command]
. When you have finished running everthing, you get a compressed .rpz
file by typing:
reprozip pack package-name.rpz
This package is typically small, and can be sent in with a paper for others to seamlessly review your work, shared in a repository for others to use at will, or simply kept as an archival snapshot of your research.
The package contains all the workflow information, intense metadata about everything ReproZip traced, and all the data and source code necessary to reproduce the work.
Let’s look at the metadata file one minute: bechdel-config.yml
Packing with ReproZip currently only works on Linux, but is in development for mac OS. It’s harder on the locked down operating systems, because of the intensity of information ReproZip collects about the computational environment where the research is happening. Currently, you can only pack on the command line as well. But we have a student this summer who is building a user interface for packing, so hopefully that will help the process of packing as well.
5.1.2 Unpacking
Unpacking the .rpz
works on any computer, with any operating system that has ReproUnzip installed. You simply get the .rpz
file, double-click on it, and in two clicks in the user interface, you can re-excecute the original user’s work (you can use the terminal as well, but the graphical interface is nice too).

Unpacking with ReproZip
ReproUnzip works on the plugin model - so while it uses Docker and Vagrant as ‘unpackers’, it doesn’t rely on them. The .rpz
is general enough that it can be used by the majority of virtual machines and container software. This summer, we have another student on making a Singularity unpacker too.
Unpacking isn’t just simple reproduction of the work though – you can extend the research via ReproUnzip too:
Secondary users who want to test a process with their own data can upload new inputs:

Uploading a new input file
Other users looking to verify results, or just see the results, can also download all the output data right from ReproUnzip:

Downloading the output files
You can also visualize the workflow/execution the research using our provenance graph plugin (CLI only), or VisTrails, a scientific workflow management system (which, if we had more time, I’d get into how to use!).
5.1.2.1 ReproServer
We know that unpacking a .rpz
locally So, we made ReproServer to complement ReproUnzip!
ReproServer runs ReproZip packages in the browser, no local software needed! You can either pass ReproServer a link to a ReproZip package, or upload one. It has all the functionality of ReproUnzip on your desktop. You can change configurations, see the execution of the .rpz
, download uploads, and upload inputs.
We built this so you would have no lock-in: build on your laptop, pack automatically, reproduce anywhere. ReproServer even gives you a URL include in papers to reproduce your work, though the actual .rpz
file should go into a repository (just pass ReproServer the link!).

Unpacking, now with ReproServer!
We are going to test out ReproServer! Go there now: https://server.reprozip.org
CHALLENGE 1:
- Give the link to this
.rpz
file to ReproServer: https://osf.io/5ztp2/ - Go through the steps on ReproServer to reproduce the work.
- BONUS: upload a new input – photo_2.jpg
- BONUS: download the new output image to your computer.
- Raise your hand to show you’ve finished!
5.2 BONUS: Pack Jupyter Notebooks
We also have a plugin to pack jupyter notebooks, which we couldn’t get on the Jupyter Notebook class installation in time unfortunately.
On the terminal, you run the following commands in order:
pip install reprozip-jupyter
jupyter nbextension install --py reprozip_jupyter --user
jupyter nbextension enable --py reprozip_jupyter --user
jupyter serverextension enable --py reprozip_jupyter --user
Then, the next time you open Jupyter notebooks, you should have a little button which automatically runs the notebook from top-to-bottom, and packs it with ReproZip!
5.3 CONGRATS
You know more about computational reproducibility AND reproduced someone else’s work with a ReproZip package!
5.4 Further Reading
ReproZip:
- Official website: reprozip.org
- Examples of folks using ReproZip: examples.reprozip.org
- YouTube channel with tutorials: youtube.com/channel/UCG_yo1KKvhWSygxCBQqd_vQ
- Documentation: docs.reprozip.org/en/1.0.x/