Reproducing and Preserving Research with ReproZip


Remi Rampin, Vicky Steeves, Fernando Chirigati

@remram44, @VickySteeves, @fchirigati

IASSIST 2017 | May 25, 2017


Obligatory PhD Comics strip...

Reproducibility exists on a spectrum


  • Reviewable Research: Sufficient detail for peer review & assessment.
  • Replicable Research: Tools are available to duplicate the author’s results using their data.
  • Confirmable Research: Main conclusions can be attained independently without author’s software.
  • Auditable Research: Process & tools archived such that it can be defended later if necessary.
  • Open/Reproducible Research: Auditable research made openly available.

Stodden et al ICERM report (2013)

Challenge 1: Everyone Messes Up

Gap: a tool to seamlessly review whole research projects without the reviewer having to manually install and debug all dependencies, code, and data.

Excel is Terrible

Challenge 2: Environments are Hard to Capture

Gap: tools that can automatically capture all the dependencies in the original environment and automatically set them up in another environment.

Even if runnable, results may differ

The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements | June 1, 2012

We investigated the effects of data processing variables such as FreeSurfer version (v4.3.1, v4.5.0, and v5.0.0), workstation (Macintosh and Hewlett-Packard), and Macintosh operating system version (OSX 10.5 and OSX 10.6). Significant differences were revealed between FreeSurfer version v5.0.0 and the two earlier versions. [...] About a factor two smaller differences were detected between Macintosh and Hewlett-Packard workstations and between OSX 10.5 and OSX 10.6.

The New Traditional Model

  • Publish a paper.
  • Publish the underlying code and data.
  • Link the paper + code/data.
  • Bump up your H-Index.

And I have slides on that model here and here.

ReproZip tries to solve...

Workload & Time Challenges
It is a time commitment to get data and code ready to share, and to share it

Otherwise known as...

the Incentive Problem
Reproducibility takes time, and is not always valued by the academic reward structure

"Insufficient time is the main reason why scientists do not make their data and experiment available and reproducible."
Carol Tenopir, Beyond the PDF2 Conference

"77% claim that they do not have time to document and clean up the code."
Victoria Stodden, Survey of the Machine Learning Community – NIPS 2010

ReproZip tries to solve...

Technical Obsolescence
Technology changes affect the reproducibility

Normative Dissonance1
Espoused values don’t always match practice

Otherwise known as...

the Pipeline Problem
Reproducibility requires skills that are not included in most curriculums!

"It would require huge amount of effort to make our code work with the latest versions of these tools."
Collberg et al., Repeatability and Benefaction in Computer Systems Research, University of Arizona TR 14-04

1https://www.ncbi.nlm.nih.gov/pubmed/19385804

But the article, code, and data are really just the tip of the iceberg

What does this mean for...


Librarians

We now have to help researchers license code + data + computer environments, select or build repositories to reliably store those objects, and preserve them forever.

Researchers

They have to clean up code and code, learn how to capture computer environments and make it shareable, without spending all their time + research budget on it.

But what if I told you that you could put code + data + applications + environment in one small file that has a ton of automatically captured metadata and provenance?

Tool to Help: ReproZip!

ReproZip is a tool aimed at simplifying the process of creating reproducible...whatever. It can be research, it can be applications, it can be databases, it can be websites...if you can do it on a computer, chances are we can pack it!

2 Steps to Reproducibility

Step 1: Trace & Pack

reprozip trace [command]
reprozip pack package-name.rpz

Before you pack, you can edit the config.yml (but it's not recommended). This is what a config looks like.

Step 2: Set up & Run

Double click on the RPZ file, and choose your unpacker!

Not just simple reproduction...

When you unpack your .rpz package with the GUI, you'll come to this screen.

Download the results or add your own input!


Download Ouput

Upload New Inputs

ReproZip can pack

Data analysis scripts / software (any language, you name it!)

Graphical tools

Interactive tools

Client-server applications (including databases)

Jupyter notebooks

MPI experiments (setting up the experiment is involved though…)

… and much more!

Current Use Cases

Rec. by the Information Systems Journal, Reproducibility Section

Rec. by the ACM SIGMOD Reproducibility Review

Listed on the ACM Artifact Evaluation Process Guidelines

Integrated as a component of CoRR

Archiving data journalism apps like StackedUp

… and many more!

Potential for ReproZip in (Academic) Libraries

Liaison Librarians

  • Liaison libs have an excellent opportunity as the first line of contact with patrons to encourage and disseminate information on reproducible practices (e.g. using ReproZip).
  • The library gains an excellent way to build collections of diverse, preservation-ready research outputs, and patrons easily adopt reproducible practices.

Data Services

Adding ReproZip to the curriculum of data services classes and workshops, and to the list of supported software, data services teams can become a center on campus for assisting the reproducibility of their patrons' scholarship.

Potential for ReproZip in (Digital) Libraries

Digital Libraries

  • ReproZip isn't relient on Docker or Vagrant; it uses a plugin model for unpackers, so new ones can be added for forward compatibility.
  • If no containers or VMs exist in the future, then the archivist can still read and use the robust technical and administrative metadata in ReproZip's config.yml.

Repository Management

ReproZip contains extensive technical and administrative metadata, which can be an be exported a json file, which allows for extensible models of metadata – such as crosswalking to Resource Description Framework (RDF) or Dublin Core, automating ingest workflows.

Future Development Work

  • Packing on macOS (beginning summer 2017)
  • Improvements to provenance graph visualizaion (beginning summer 2017), because right now we've got...
  • ReproZip plugin for Jupyter Notebooks (beginning summer 2017)
  • Better MPI/HPC support

Conclusion


ReproZip is extensible enough to be used for reproducibility across research domains as well as across library services.

Because ReproZip is open-source software, the community drives its development – others can contribute, modify, and reuse it for a variety of purposes.

The library community can leverage ReproZip in instruction, consultation, repository services, digital archiving, and in their own research.

Other Resources for ReproZip

ReproZip Website: https://reprozip.org

ReproZip Examples: https://examples.reprozip.org

ReproZip GitHub: https://github.com/ViDA-NYU/reprozip

ReproZip Mailing list: reprozip-users@vgc.poly.edu

ReproZip YouTube Demos:

ReproZip on Twitter: https://goo.gl/d6NXoH

Thank You:

Prof. Juliana Freire, ReproZip PI and reproducibility master.

Dr. Nicholas Wolf for his help in editing our paper.

The Gordon and Betty Moore Foundation & the Alfred P. Sloan Foundation, who support The Moore-Sloan Data Science Environment at NYU, which was vital to the development of ReproZip.

Questions?

Get this Presentation: https://vickysteeves.gitlab.io/2017-IASSIST-ReproZip

Email us:
reprozip-users@vgc.poly.edu
or
vicky.steeves@nyu.edu
or
remi.rampin@nyu.edu
or
fchirigati@nyu.edu