Creating Reproducible Experiments with ReproZip


Remi Rampin, Vicky Steeves, Fernando Chirigati

@remram44, @VickySteeves, @fchirigati

Presentation: vickysteeves.gitlab.io/2017-SciPy | July 13, 2017


In collaboration with Juliana Freire & Dennis Shasha

Obligatory PhD Comics strip...

Why Reproducibility?

"If I have seen further, it is by standing on the shoulders of giants." - Sir Isaac Newton

To build on top of previous work – science is incremental!

To verify the correctness of results

To defeat self-deception1

To help newcomers

To increase impact, visibility2 and research quality3

Reproducibility exists on a spectrum


  • Reviewable Research: Sufficient detail for peer review & assessment.
  • Replicable Research: Tools are available to duplicate the author’s results using their data.
  • Confirmable Research: Main conclusions can be attained independently without author’s software.
  • Auditable Research: Process & tools archived such that it can be defended later if necessary.
  • Open/Reproducible Research: Auditable research made openly available.

Stodden et al ICERM report (2013)

Another way to look at the spectrum...

Challenge: Environments are Hard to Capture

Gap: tools that can automatically capture all the dependencies in the original environment and automatically set them up in another environment.

Even if runnable, results may differ

The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements | June 1, 2012

We investigated the effects of data processing variables such as FreeSurfer version (v4.3.1, v4.5.0, and v5.0.0), workstation (Macintosh and Hewlett-Packard), and Macintosh operating system version (OSX 10.5 and OSX 10.6). Significant differences were revealed between FreeSurfer version v5.0.0 and the two earlier versions. [...] About a factor two smaller differences were detected between Macintosh and Hewlett-Packard workstations and between OSX 10.5 and OSX 10.6.

ReproZip tries to solve...

Workload & Time Challenges
It is a time commitment to get data and code ready to share, and to share it

Otherwise known as...

the Incentive Problem
Reproducibility takes time, and is not always valued by the academic reward structure

"Insufficient time is the main reason why scientists do not make their data and experiment available and reproducible."
Carol Tenopir, Beyond the PDF2 Conference

"77% claim that they do not have time to document and clean up the code."
Victoria Stodden, Survey of the Machine Learning Community – NIPS 2010

ReproZip tries to solve...

Technical Obsolescence
Technology changes affect the reproducibility

Normative Dissonance1
Espoused values don’t always match practice

Otherwise known as...

the Pipeline Problem
Reproducibility requires skills that are not included in most curriculums!

"It would require huge amount of effort to make our code work with the latest versions of these tools."
Collberg et al., Repeatability and Benefaction in Computer Systems Research, University of Arizona TR 14-04

1https://www.ncbi.nlm.nih.gov/pubmed/19385804

So, what is ReproZip?

ReproZip is a tool developed in Python (the tracing part in C) aimed at simplifying the process of creating reproducible... whatever. It can be research, it can be applications, it can be databases, it can be websites... if you can do it on a computer, chances are we can pack it!

2 Steps to Reproducibility

Step 1: Trace & Pack

reprozip trace [command]
reprozip pack package-name.rpz

Before you pack, you can edit the config.yml (optional, advanced usage of the tool).

pack_id: "1a9b02a5-ea7f-4dc0-9d31-176e70acaf92"
version: "0.8"
runs:
# Run 0
- architecture: x86_64
  argv: [python, fetch.py]
  binary: /home/vagrant/bechdel/venv/bin/python
  distribution: [debian, '8.3']
  environ: {HOME: /home/vagrant, LANG: en_US.UTF-8, TERM: xterm-color, USER: vagrant}
  exitcode: 0
  gid: 1000
  hostname: debian-83-amd64
  id: collectdata
  system: [Linux, 3.16.0-4-amd64]
  uid: 1000
  workingdir: /home/vagrant/bechdel

# Run 1
- architecture: x86_64
  ...
# Inputs are files that are only read by a run; reprounzip can replace these
# files on demand to run the experiment with custom data.
# Outputs are files that are generated by a run; reprounzip can extract these
# files from the experiment on demand, for the user to examine.
# The name field is the identifier the user will use to access these files.
inputs_outputs:
- name: bechdel.json
  path: /home/vagrant/bechdel/bechdel.json
  written_by_runs: [0]
  read_by_runs: [1]
- name: revenue.csv
  path: /home/vagrant/bechdel/revenue.csv
  written_by_runs: [0]
  read_by_runs: [1]
- name: revenue.png
  path: /home/vagrant/bechdel/revenue.png
  written_by_runs: [1]
  read_by_runs: []
- ...
# These files come from packages; we can thus choose not to include them, as it
# will simply be possible to install that package on the destination system
# They are included anyway by default
packages:
  - name: "dash"
    version: "0.5.7-4+b1"
    size: 195584
    packfiles: true
    files:
      # Total files used: 122.46 KB
      # Installed package size: 191.00 KB
      - "/bin/dash" # 122.46 KB
      - "/bin/sh" # Link to /bin/dash
  - name: "libblas3"
    version: "1.2.20110419-10"
    size: 569344
    packfiles: true
    files:
      # Total files used: 511.23 KB
      # Installed package size: 556.00 KB
      - "/usr/lib/libblas/libblas.so.3" # Link to /usr/lib/libblas/libblas.so.3.0
      - "/usr/lib/libblas/libblas.so.3.0" # 511.23 KB
  - name: "libc-bin"
    version: "2.19-18+deb8u2"
    size: 3341312
    packfiles: true
    files:
      # Total files used: 870.03 KB
      # Installed package size: 3.19 MB
      - "/etc/gai.conf" # 2.52 KB
      - "/sbin/ldconfig" # 387.0 bytes
      - "/sbin/ldconfig.real" # 867.13 KB
  - ...

# These files do not appear to come with an installed package -- you probably
# want them packed
other_files:
  - "/etc/hosts"
  - "/etc/resolv.conf"
  - "/home/vagrant/.cache/matplotlib/fontList.cache"
  - "/home/vagrant/.cache/matplotlib/tex.cache"
  - "/home/vagrant/.config/matplotlib"
  - "/home/vagrant/bechdel/bechdel.json"
  - "/home/vagrant/bechdel/bechdel.py"
  - "/home/vagrant/bechdel/cpi.csv"
  - "/home/vagrant/bechdel/fetch.py"
  - "/home/vagrant/bechdel/imdb_data.json"
  - "/home/vagrant/bechdel/revenue.csv"
  - "/home/vagrant/bechdel/venv/bin/python"
  - "/home/vagrant/bechdel/venv/lib/python-wheels/chardet-2.3.0-py2.py3-none-any.whl"
  - "/home/vagrant/bechdel/venv/lib/python-wheels/pip-1.5.6-py2.py3-none-any.whl"
  - "/home/vagrant/bechdel/venv/lib/python-wheels/requests-2.4.3-py2.py3-none-any.whl"
  - "/home/vagrant/bechdel/venv/lib/python-wheels/setuptools-5.5.1-py2.py3-none-any.whl"
  - "/home/vagrant/bechdel/venv/lib/python-wheels/urllib3-1.9.1-py2.py3-none-any.whl"
  - "/home/vagrant/bechdel/venv/lib/python2.7/abc.py"
  - "/home/vagrant/bechdel/venv/lib/python2.7/codecs.py"
  - "/home/vagrant/bechdel/venv/lib/python2.7/copy_reg.py"
  - "/home/vagrant/bechdel/venv/lib/python2.7/distutils/__init__.py"
  - "/home/vagrant/bechdel/venv/lib/python2.7/distutils/distutils.cfg"
  - "/home/vagrant/bechdel/venv/lib/python2.7/encodings/__init__.py"
  - "/home/vagrant/bechdel/venv/lib/python2.7/encodings/ascii.py"
  - "/home/vagrant/bechdel/venv/lib/python2.7/encodings/base64_codec.py"
  - "/home/vagrant/bechdel/venv/lib/python2.7/encodings/charmap.py"
  - "/home/vagrant/bechdel/venv/lib/python2.7/fnmatch.py"
  - ...

Tracing

Packing

Step 2: Set up & Run

Double click on the RPZ file, and choose your unpacker!

Setting Up

Running

Not just simple reproduction...

When you unpack your .rpz package with the GUI, you'll see:

Downloading Outputs

Uploading Inputs

Bonus: Jupyter Notebooks

pip install reprozip-jupyter
jupyter nbextension install --py reprozip_jupyter --user
jupyter nbextension enable --py reprozip_jupyter --user
jupyter serverextension enable --py reprozip_jupyter --user

ReproZip can pack

Data analysis scripts / software (any language, you name it!)

Graphical tools

Interactive tools

Client-server applications (including databases)

Jupyter notebooks

MPI experiments (setting up the experiment is involved though…)

… and much more!

Current Use Cases

Rec. by the Information Systems Journal, Reproducibility Section

Rec. by the ACM SIGMOD Reproducibility Review

Listed on the ACM Artifact Evaluation Process Guidelines

Integrated as a component of CoRR

Archiving data journalism apps like StackedUp

… and many more!

Future Development Work

Other Resources for ReproZip

ReproZip Website:
reprozip.org

ReproZip Examples: examples.reprozip.org

ReproZip GitHub: github.com/ViDA-NYU/reprozip

ReproZip Mailing list: users@reprozip.org

ReproZip YouTube Demos:

ReproZip on Twitter: twitter.com/search?q=reprozip

Thank You:

Prof. Juliana Freire, ReproZip PI.

The Gordon and Betty Moore Foundation & the Alfred P. Sloan Foundation, who support The Moore-Sloan Data Science Environment at NYU, which was vital to the development of ReproZip.

Questions?

Get this Presentation: vickysteeves.gitlab.io/2017-SciPy

Email us:
users@reprozip.org
or
vicky.steeves@nyu.edu
or
remi.rampin@nyu.edu
or
fchirigati@nyu.edu