Introduction to Git

Why do we care about version control?

Version control is “the management of changes to documents, computer programs, large web sites, and other collections of information” (from Wikipedia).

Basically, it’s a way for us to compare, restore, and merge changes to our stuff. Rather than emailing documents with tracked changes and some comments and renaming different versions of files (example.txt, exampleV2.txt, exampleV3.text) to differentiate them, we can use version control to save all that information with the document itself. We want to avoid this:

PhD comics – a tale of many versions

This makes it easy to get an overview of all changes made to a file over time by looking at a log of all the changes that have been made. And all earlier versions of each file still remain in their original form: they are not overwritten, and we can always go back in time to view the contents. So we don’t need the v3, v4, etc. suffix to our files – we can just time travel.

The few most important factors of version control are:

Collaboration - When several people collaborate in the same project, it’s possible to accidentally overlook or overwrite someone’s changes. Version control systems automatically notifies users whenever there’s a conflict between one person’s work and another’s, who then have to decide collaboratively what to keep/overwrite. Using version control systems also helps contribute to open source software, which can be very large-scale collaborations!
Versioning - Having a robust and rigorous log of changes to a file, without renaming files with suffixes like v1, v2, final_copy.
Rolling Back - Version control allows us to quickly undo a set of changes. This can be useful when new writing or new additions to code introduce problems. Since all old versions of files are saved, it’s always possible to go back in time to see exactly who wrote what on a particular day.
Understanding - Version control can help you understand how the code or writing came to be, who wrote or contributed particular parts, and who you might ask to help understand it better. We know who to ask if we have questions later on about the codebase!

What is Git?

Git is one of the most widely used revision control systems in the world. Git is not the same thing as GitHub. Git is a free, open source tool that can be downloaded to your local machine or server and used for logging all changes made to a folder (referred to as a “Git repository” or “repo” for short) over time. Git works the best for plain-text formats like .csv, .py, .json, and more. It cannot version binaries like Microsoft Word documents (.docx) nor can it version control PDFs, though both file types can be stored in Git repositories.

Git can be used to control file versions locally by you alone on your computer, but is perhaps most powerful when employed to coordinate simultaneous work on a group of files shared among distributed groups of people. A Git repository contains, among other things, the following:

Snapshots of your files (text, images, whatever file that isn’t binary)
References to these snapshots, called heads

Once installed, interaction with Git is done through the terminal <3 You may be able to interact with Git from your favorite IDE (like VS Code or Atom), but I’ve found that working with Git on the terminal is a more transferable skill to have. You can pick up a terminal on any type of machine you will likely ever need to work with, but you may not always have access to your favorite IDE (sorry 😓).

Examples

How are people using Git and Git hosting platforms now? Well, for open science, open humanities, open data, open code – all things open! Here are a few different types of examples:

Open Data

US Congressional data: github.com/unitedstates/congress-legislators
NYU HSL Data Catalog: github.com/nyuhsl/data-catalog
OpenBenches.org - an open data for memorial benches: gitlab.com/edent/openbenches

Dissertation Writing

“How the environment of a simulated swarm affects evolved flocking behaviours” by Jacob Causon: gitlab.com/jake314159/Y3_final_report
“Open Source Code and Low Resource Languages” by Richard Litt: github.com/RichardLitt/thesis
“Homology of Moduli Spaces” by Felix Jonathan Boes: gitlab.com/DerFelix/phd_thesis

Building a professional website

Wootton Cybersecurity Club: gitlab.com/wsec/wsec.gitlab.io
Luc Sarzyniec: gitlab.com/olbat/olbat.gitlab.io
Adam Sparks: github.com/adamhsparks/adamhsparks.github.io

Random interesting stuff

Open Powerlifting Dataset & Website: gitlab.com/openpowerlifting/opl-data
bussard - a spaceflight programming adventure: gitlab.com/technomancy/bussard
is-thirteen? An npm package to check if a number is equal to 13: github.com/jezen/is-thirteen

How Git Controls Versions

Git works on branches, which represent independent lines of development. Each snapshot of files is linked to the ‘parent’ snapshot that it is built upon. By default, everyone’s repositories are on a “main” branch, also called a trunk in other version control systems – because all branches come from it. When you make a new repository on GitHub, as of October 1 2020, the default branch is “main” so we’ll be working with that terminology to keep it all consistent. We’ll learn more about branches later in the tutorial, and for now let’s just examine how Git works.

There are three states that the files in your repository can be in locally:

You are just working normally in your working directory.
You move files to the staging area so Git knows it could potentially become the next version.
After you commit to your changes, they become the newest version in the repository!

As you work, you move between these three states many, many times throughout the life of a project. These are done with some simple commands in the terminal which we’ll go over today! Git stages from https://git-scm.com/about

Ok so now that you have a feel for the basics of Git, let’s move onto the practical part of the session and get on the terminal!