1 Version Control

Video of this chapter’s lecture can be viewed on YouTube: Part 1 && Part 2

From Wikipedia: “the management of changes to documents, computer programs, large web sites, and other collections of information.”

Basically, it’s a way for us to compare, restore, and merge changes to our stuff. Rather than emailing documents with tracked changes and some comments and renaming different versions of files (example.txt, exampleV2.txt, exampleV3.text) to differentiate them, we can use version control to save all that information with the document itself.

We want to avoid this.

We want to avoid this.

This makes it easy to get an overview of all changes made to a file over time by looking at a log of all the changes that have been made. And all earlier versions of each file still remain in their original form: they are not overwritten, should we ever wish to “roll back” to them.

The few most important factors of version control are:

  • Collaboration - Version control allows us to define formalized ways we can work together and share writing and code. For example merging together sets of changes from different parties enables co-creation of documents and software across distributed teams. When several people collaborate in the same project, it’s possible to accidentally overlook or overwrite someone’s changes. The version control system automatically notifies users whenever there’s a conflict between one person’s work and another’s.
  • Versioning - Having a robust and rigorous log of changes to a file, without renaming files (v1, v2, final_copy)
  • Rolling Back - Version control allows us to quickly undo a set of changes. This can be useful when new writing or new additions to code introduce problems. Since all old versions of files are saved, it’s always possible to go back in time to see exactly who wrote what on a particular day.
  • Understanding - Version control can help you understand how the code or writing came to be, who wrote or contributed particular parts, and who you might ask to help understand it better. As we have this record of who made what changes when, we know who to ask if we have questions later on, and, if needed, revert to a previous version, much like the “undo” feature in an editor.
  • Backup - While not meant to be a backup solution, using version control systems mean that your code and writing can be stored on multiple other computers. Nothing that is committed to version control is ever lost, unless you work really, really hard at it.

1.1 Git

Git is one of the most widely used revision control systems in the world. It is a free, open source tool that can be downloaded to your local machine or server and used for logging all changes made to a group of designated files (referred to as a “git repository” or “repo” for short) over time. It can be used to control file versions locally by you alone on your computer, but is perhaps most powerful when employed to coordinate simultaneous work on a group of files shared among distributed groups of people. A git repository contains, among other things, the following:

  • Snapshots of your files (text, images, whatever file that isn’t binary)
  • References to these snapshots, called heads

The git repository is a hidden sub-folder in your project folder, called .git. You probably won’t have to touch this ever, but definitely don’t delete it.

Once installed, interaction with Git is done through the Command Prompt in Windows, or the Terminal on Mac/Linux. Since Word documents contain special formatting, Git unfortunately cannot version control those, nor can it version control PDFs, though both file types can be stored in Git repositories. Git can version any plain text file such as .csv, .py, .json, and more.

1.1.1 How git does version control

Git works on branches, which represent independent lines of changes. Each snapshot of files is linked to the ‘parent’ snapshot that it is built upon. By default, everyone’s repositories are on a “master” branch (Instructor’s note: I hate that terminology. Computer science has a long way to go…). We won’t go over branches in-depth for this session, but a few good tutorials on branches can be found on House of Hades and the Atlassian guides.

There are three states that your git project can be in, which will ultimately be your workflow when you interact with git locally:

  1. You are just working normally in your working directory. Git stages from https://git-scm.com/about

  2. You want to stage your work, so git knows it could potentially become the next version. Git stages from https://git-scm.com/about

  3. Your changes become the newest version in the repository! Git stages from https://git-scm.com/about

As you work, you move between these three states many, many times throughout the life of a project. These are done with some simple commands in the terminal, OR in RStudio! Git stages from https://git-scm.com/about

1.2 Project & data management

Some basic tenants of good project etiquette acrosss domains:

  • Put each project in its own directory, which is named after the project.
  • Put text documents associated with the project in the doc folder.
  • Put raw data and metadata in the data folder. These data are read-only!
  • Files generated during cleanup and analysis in a results folder.
  • Put any code or scripts for the project in the src folder.
  • Name all files to reflect their content or function, with NO special characters (!@#$%^*) or spaces! Use underscores or dashes, A-Z, and numbers!
A good general outline for project structure.

A good general outline for project structure.

1.3 Hosting Platforms

We are going to work backwards today and interact with platforms that host git before getting into the nitty-gritty of git on the command line.

There are many platforms that host git; I’ve listed the most popular four below:

Table 1.1: A comparison of the four most popular git repo hosting platforms.
Name Manager Est. Free_Software Open_Source Private_Repos Ad_Free
GitLab GitLab B.V. 2011 Yes (partial on server) Yes Yes Yes
GitHub Microsoft 2008 No No Paid Yes
BitBucket Atlassian 2008 No No Yes Yes
SourceForge BizX LLC 1999 Yes No No No

1.3.1 Examples

How are people using GitHub and GitLab now? Well, for open science, open humanities, open data, open code – all things open! Here are a few different types of examples:

Open Data

Dissertation Writing

Building a professional website

Random interesting stuff

1.4 GitLab & GitHub

GitLab is an open source git hosting platform that is rapidly rising in popularity, for a few key reasons: all features are free to all users, to a very reasonable degree, and there are many features that set GitLab apart from other services. It has continuous integration built-in to each repository, and free LFS, so we can share larger files within a repository. Another big plus – GitLab integrates with a lot of great tools and services, like JIRA, Kubernetes, and the Open Science Framework.

GitHub is a commercial code sharing platform that has gained lots of popularity in the academic community. It offers a web interface and provides functionality and a mixture of both free and paid services for working with such repositories. The majority of the content that GitHub hosts is open source software, though increasingly it is being used for other projects. It also integrates into third party software, like the Open Science Framework and Travis CI.

For the sake of our workshop, we are going to work in GitLab. You can actually log into GitLab with your GitHub account, so head to gitlab.com/users/sign_in and log in to get ready to go!

Side note: if you want to collaborate between platforms, you can! GitLab has an automatic mirroring function to sync changes between GitHub and GitLab, so you can work on GitLab and make your work discoverable on GitHub, or collaborate with your community on both platforms.

1.4.1 Working in GitLab

When you are logged into gitlab.com, you should be able to see a + sign in the top right-hand corner. This will let you create a new empty repository! You can choose the permission level of the repository – 100% private, internal (private but visible to folks logged into GitLab), or 100% public. You can start out 100% private and switch to public whenever you feel comfortable, for free!

New GitLab repository

New GitLab repository

We can keep a copy of our code locally and in this central repository on GitLab. This helps us make sure our code isn’t only stored in one place (our laptops) at any given time. But it also lets us collaborate on code with both our colleagues and strangers!

For our colleagues, we can add them as collaborators within our repository with varying levels of permission - we can even give them an expiration date, if their term on a project ends on a certain date!

Looking at our collaborators in GitLab

Looking at our collaborators in GitLab

CHALLENGE 1:

  1. Break into pairs
  2. Decide who is Person A and who is Person B.
  3. Person A: make a respitory on GitLab and add Person B as a collaborator.
  4. Raise your hand to show you’ve finished!


1.4.2 Getting started in the repository

There are a few key things everyone needs in their repository. The first important file is the README.md file. A README file broadly contains information about other files in a directory. It is usually in a plain text format, like markdown (.md) or text (.txt). A good README contains:

HEADING CONTENTS QUESTIONS TO ANSWER
TITLE/SUMMARY General information What does your project do? How is it used? Share your vision!
AUTHORS Credits for maintainers Who is responsible for this project?
GETTING STARTED Installation & dependency instructions If someone were to pick up your project today, what dependencies would they need to install to get it to work?
LICENSE Copyright and licensing information How can others extend, use, remix, and distribute your work? Is there a particular citation format to use?
CONTRIBUTING Guide for prospective contributors How can others help? Make it easy for others to get involved by letting them know how to submit new features, report issues, or offer other assistance.
THANKS Acknowledgments OSS can sometimes be thankless. Don’t be that person! Acknowledge the entities who help you. You can even provide a link to your say thanks inbox to pay that effort forward

Choosing a license Choosing a license is an important part of openly sharing your creative work online. For help in wading through the many types of open source licenses, please visit https://choosealicense.com/.

CHALLENGE 2:

  1. Person A: Make a README.md file in your repository in the GitLab interface (you don’t need all the above sections! Pick a few).
  2. Save (‘commit’) the new file in the GitLab interface with a good message!
  3. Person B: make a change after Person A to add some extra description.
  4. Raise your hand to show you’ve finished!


After completing this challenge, you will have used a lot of key features of version control such as: seeing who did what when with descriptive messages and content, where each change is uniquely identified. Each change you save in either the GitLab interface or git locally is uniquely indeitifed with a commit.

A commit records changes to the repository, and is assigned a unique hash that users can leverage for many purposes, like rewriting history! We’ll look at this later. In the GitLab interface, when you click the Commits (X) button, you will be able to see a visual timeline of all the commits from a given project.

CHALLENGE 3:

  1. Look at your commit history – right now, it should have two changes for the README.md
  2. Try to find out how to see the difference between your commits in the GitLab interface.
  3. Raise your hand to show you’ve finished!


How do your commit messages look?

The messages we attach to our commits are extremely important. Past us can’t answer emails,

A good commit message is concise, descriptive, and informative. Aim for 50 characters or less (and try to avoid screaming!).

Commit messages from XKCD

Commit messages from XKCD

1.4.3 Collaborating

As we discussed when we started this lesson, one of the best things about using version control is that it gives us the ability to collaborate reliably. There are a few ways we can encourage, facilitate, and XYZ collaborate in our repositories!

Issues

Issues help you keep track of the work happening on your project - they act much like a to-do list mixed with a discussion form. In GitLab and GitHub, you can link to specific commit messages or merge requests, or even close issues with specific relevant commits.

What’s more, is we can labels our issues in your repository that can attract outside contributions. One great example is the Hacktoberfest issue label on GitHub! This label was created for the Hacktoberfest event, an annual online celebration of open source where folks get a prize for contributing at a minimum four pull requests between October 1–31 in any timezone. The label was created for folks to solict collaborations and contributions for those who want to make their four pull requests!

There are other such labels: help-wanted, beginner-friendly, and the list goes on. If you need help on an issue, a label is a good way to solict that!

Side note: make sure you and the maintainers on the repository agree on labels! You don’t want a lot of duplicates confusing folks!

Forks & Merge Requests

For everyone who we don’t want to give direct access to a repository (complete strangers, who want to help us, for instance!), they must fork our repository and submit a merge request to get their contributions integrated into ours! We sometimes want to even fork projects were we have access, to get peer reviews before we integrate our contributions!

A fork is a copy of a repository in your namespace (under your account). Forking a repository allows you to freely experiment with changes without affecting the original project.

A merge request is when you want to integrate the changes you made into the original repository you forked. You describe the changes you made and make sure your changes don’t conflict with the original repo’s code.

I have a friend who asked me to look over his R package to format articles for Copernicus journals. I need to fork the repository, make changes, and then contribute my review and changes back to him.

The first step in this process is to fork a repository. GitLab has made this as easy as a button click: Forking a repository in GitLab

You can then choose where you want to put the new repository – into your own account, or a group account! I am going to put this repositor into my own account.

Forking a repository in GitLab

Forking a repository in GitLab

From here, I can make edits and commits, everything the same as with my own reposiory, since it is my own repository (notice it’s under my account now)!

Forking a repository in GitLab

Forking a repository in GitLab

CHALLENGE 4:

  1. Person B: Fork Person A’s repository.
  2. Person B: Make a change and save it (with a good commit message).
  3. Raise your hand to show you’ve finished!


1.5 Working Locally

I can do a lot in the GitLab interface, but to work on Daniel’s project with any success, I need to work in RStudio and test out the way his new Copernicus package works. Now, I need to get the contents of the repository on my local computer!

This means we have to use git on the command line!

1.5.1 Configure git

Before being able to use git to work on projects, you first need to configure git with your name and email address. In a project, everyone needs to see what exactly other collaborators have been doing. In a version control system like git, this is done through two commands run in the Terminal.

On Windows, you can search cmd to get to the terminal, and on Mac, you can search Terminal in the spotlight search. You should see a small black window show up. Type:

$ git config --global user.name 'Your Name'

And substitute ‘Your Name’ for your given name and your family name in any order (if applicable; if not, use the name that uniquely identifies you!). Hit enter when after you’ve typed the full line. Next, type the following:

$ git config --global user.email 'your@email.com'

Substituting ‘your@email.com’ with your email address. Next, we’ll need to tell git which text editor we favour in case we ever need to deal with merge conflicts (we won’t in this class, but in the future…).

$ git config --global core.editor "gedit"

Substituting ‘gedit’ for your favourite plain text editor. This could be simply Notepad (Windows) or TextWrangler (Mac), but NOT Microsoft Word, LibreOffice Writer, or other rich text editors. We need it to be as plain as can be!

At any point in the process, you can double check everything you’ve just put in with:

$ git config --list

In the end, this is sort-of what you should be aiming for:

Configure git in the terminal

Configure git in the terminal

CHALLENGE 5:

  1. Everyone: open up the terminal and configure git.
  2. Raise your hand to show you’ve finished!


Always remember that if you forget a git command, you can access the list of commands by using -h flag and access the Git manual by using --help argument.

$ git -h
$ git --help

1.5.2 Getting the hosted repository

After telling git who we are, we can get the contents of our repository on GitLab on our local computer via cloning the repository! We do that by using the git clone command along with the URL of the repository we want to clone, plus .git. Let’s just put the repository on the desktop, using the change directory command, cd.

$ cd Desktop
$ git clone https://github.com/username/repositoryname.git

You will be asked to put in your GitLab username and password, and then you’ve just cloned your repository! Now you can work like normal; in this case, it means I try out Daniel’s R package to see if it works for me!

CHALLENGE 6:

  1. Everyone clone your own repository.
  2. Create a project structure in your git repository from yesterday that resembles the best practice from the project management section.
  3. Raise your hand to show you’ve finished!


1.5.3 Making changes locally

So far, to make a commit on GitLab, we just edit the file in the platform and click the commit button.

Locally, git uses a two-stage commit process. Changes must first be added to the staging area, then committed from there. This two-stage process gives us a lot of control over what should and should not be included in a particular commit. This is the workflow you’ll use over and over again locally:

  1. git add filename.extension
  2. git commit -m 'super descriptive commit message'

These two commands, git add and git commit, are required to record all our local changes. These help us track a single file, a select group of file, or everything in the repository (not necessarily recommended).

Let’s Track a File

Please following along with me as I type the commands in this next section only if you are Person B! Person A, sit tight for now and follow along on Person B’s computer (your quiz time will come!). Let’s begin with:

  1. Open your plain text editor.
  2. Type out: “hi there everyone, I am learning Git.”
  3. Save this file as hi.txt in the cloned repository folder.
  4. Go back to the command line.

On the command line, let’s look at the status of our project:

$ git status

This status is telling us that git has noticed a new file in our directory that we are not yet tracking – the filename hi.txt should be red. We now want to tell git that we want to track any changes we make to hi.txt, we use git add. This adds the new txt file to the staging area (where git checks for file changes). Type the following as separate commands:

$ git add hi.txt
$ git status

The filename hi.txt should be green now, which is git visually cueing us to the fact that there is a new file waiting for us to commit to it! Before we do that, let’s just make one more change.

  1. Open hi.txt in our plain text editor again.
  2. Add a new line to the file (be creative!).
  3. Save the file.
  4. Go back to the command line.

Let’s see what git thinks about our latest change with the status command:

$ git status

Git tells us that we’ve indeed made another change to our file which isn’t staged. We can add the new version of the file to the staging area with the same command from before:

$ git add hi.txt

When we think it’s ready, we can commit to our new version of the text file!

$ git commit -m 'Created hi.txt'

Having made a commit, we now have a permanent record of what was changed, along with metadata about who made the commit and at what time.

CHALLENGE 7:

  1. Person B: make a change to a file.
  2. Add the file and commit it.
  3. Raise your hand to show you’ve finished!


1.5.4 Viewing and rewriting history

To review what you’ve been up to, type this in the terminal:

$ git log

This will list your commits with their IDs, date/time of creation, associated person, and commit messages. If you want to only look at the changes to a specific file, enter this command in the terminal:

$ git log filename.extension

This will list changes as before, but only those affecting this file, such as the one we just created! Remember that weird number from git log next to commits? This unique hash allows you to refer to that version, and you can use it to view, rewrite, and overwrite your history! The checkout command is used to tell Git to revert files back to the version listed. So, if we use:

$ git checkout hi.txt <hash>

CHALLENGE 8:

  1. Person B: view the history of our hi.txt file.
  2. Revert to the first saved version of hi.txt.
  3. Switch back to the most recent version of hi.txt.
  4. Raise your hand to show you’ve finished!


Though I’m not showing you how to do this right now, you can actually revert to a version of a file permanently. This is a bit beyond scope, but it’s do-able as you get more comfortable with git!

At the moment our changes are only recorded locally, on our computer. If we wanted to work collaboratively with someone else they would have no way of seeing what we’ve done.

1.6 Syncing local changes to your hosting platform!

When we’ve added and committed to our heart’s intent, it’s time to put our changes on the Internet! We will have to “push” our local changes to the GitLab or GitHub repository. We do this using the git push command:

$ git push -u origin master

Note: For the sake of our beginner class, you can only push when everything has been commited and the working directory is clean. When you get more advanced, you can get fancy about what to push when, but not right now!

The nickname of our remote repository is “origin” and the default branch name is “master”. The -u flag tells git to remember the parameters, so that next time we can simply run git push and git will know what to do.

You may be prompted to enter your GitLab username and password to complete the command. After the command is finished running, go to your repository on GitLab, hit refresh, and see your changes reflected there!

So back to our scenario

I have finished making changes to Daniel’s work, and I want him to get my review and contrbution. I added, committed, and pushed all my changes to my repository on GitLab. Now, I want my changes to be integrated into the official/original repository, I make a merge request! This too, GitLab has made easy for us. Click the ‘Merge Request’ tab on the GitLab sidebar. From there, it’s a simple button click to start your merge request:

Starting your merge request in GitLab

Starting your merge request in GitLab

Then, GitLab will show you all the changes made and the differences between the code in each repository. You can compare to make sure you don’t have any conflicts! Then, you’ll have to describe all the changes you’ve made to the code:

Forking a repository in GitLab

Forking a repository in GitLab

The last step is simply to submit the merge request and await feedback!

CHALLENGE 9:

  1. Person B: Push your changes to your repository on GitLab.
  2. Make a merge request back to Person A’s original repository.
  3. Person A: add a line of discussion to the merge request thanking Person B for their contributions, and click the merge button!
  4. Raise your hand to show you’ve finished!


In Person A’s GitLab repository, y’all should be able to see the commit history from Person B!

1.6.1 Pulling changes

Let’ say Person A wants to keep the copy of their project up-to-date with the version on GitLab, especially after such a great and important merge request! Person A can now use the pull command to bringing changes from a remote repository to the local repository. It’s different from cloning because it only gets the changes we don’t currently have; cloning gets the whole repository!

To pull changes from the project hosted on GitLab to our local computer, we open the terminal, navigate to our project folder using the cd (change directory) command, and type:

$ git pull

Git tells us that we have fast-forwarded our local repository to include the most recent changes from GitLab.

CHALLENGE 10:

  1. Person A: Pull the changes from your repository on GitLab to your local computer.
  2. Add, commit, and push a file.
  3. Raise your hand to show you’ve finished!


1.6.2 Sync a fork with the original repo!

Your local repository is currently set up to get information from to your repository hosted on GitLab. BUT - your repository hosted on GitLab is a fork of someone else’s repository! So let’s learn how to sync a fork and the original repository. The first step is to see the current configured remote repository:

$ git remote -v

The output should look something like this:

origin    https://gitlab.com/YOUR_USERNAME/YOUR_FORK.git (fetch)
origin    https://gitlab.com/YOUR_USERNAME/YOUR_FORK.git (push)

This basically tells us what we already know – your local repository is linked to your GitLab repository. Origin here is just the name for the URL to your repository on GitLab. Let’s add another link to the original person’s repository! We do this by specifying a new remote upstream repository that will be synced with the fork.

$ git remote add upstream https://gitlab.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git

Verify the new upstream repository you’ve specified for your fork by using that previous remote -v command:

$ git remote -v
origin    https://gitlab.com/YOUR_USERNAME/YOUR_FORK.git (fetch)
origin    https://gitlab.com/YOUR_USERNAME/YOUR_FORK.git (push)
upstream  https://gitlab.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git (fetch)
upstream  https://gitlab.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git (push)

CHALLENGE 10:

  1. Person B: Add Person A’s original repository as your upstream repository.
  2. Pull their changes.
  3. Push the changes you received from the upstream repository to your origin repository.
  4. Raise your hand to show you’ve finished!


1.7 CONGRATS!

You have all just collaborated with someone using Git and GitLab!!

1.8 Further Reading

Git:

  • Pro Git book: The entire Pro Git book, written by Scott Chacon and Ben Straub and published by Apress (available in many languages!).
  • TryGit: enter git commands in-browser to help reaffirm beginner git skills!
  • Git: The Simple Guide: step-by-step Git tutorial.
  • Think Like A Git: for someone who’s been using Git, but doesn’t feel they really understand it.

GitLab:

GitHub: