2 Cleaning Data

** Video of this chapter’s lecture can be viewed on YouTube: https://www.youtube.com/watch?v=bzEwuPjnHPQ **

Now that we know how to track changes to all our project files, we can start cleaning our data, doing our analyses, and more! Please download the data if you haven’t already from osf.io/39fus/download somewhere you can easily find it.

2.1 What makes data messy??

“Messy data” is data that’s full of weird inconsistencies, either because of human error or poorly designed systems (like Excel!). These inconsistencies make it neigh on impossible to do any meaningful searching, statistics, or other analysis!

Table 2.1: “Using OpenRefine by Ruben Verborgh and Max De Wilde, September 2013”
Dates Price State
2015-10-14 $1000 ID
10/14/2015 1000 I.D.
10/14/15 1000 US-ID
Oct 14, 2015 1000 dollars idaho
Wed, Oct 14th US$1000 idaho,
42291 $1K Idhaho

CONVERSATION BREAK 1:

What’s wrong with the data in this table?


Messy data has in fact caused a lot of problems in many research domains. These are perhaps the two most famous:

Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics

A little detective work traced the problem to default date format conversions and floating-point format conversions in the very useful Excel program package. The date conversions affect at least 30 gene names; the floating-point conversions affect at least 2,000 if Riken identifiers are included. These conversions are irreversible; the original gene names cannot be recovered.

Growth in a Time of Debt

This very famous economics paper shows that countries with debt over 90% of their gross domestic product (GDP) have a negative growth rate; this paper was published at the same time that Greece was having an economic crisis. But no one could actually reproduce the conclusions that the authors had – researchers could not replicate the results of the paper. Eventually, researchers from UMass asked the authors for their data (Excel spreadsheet), and it turned out that there was a mistake in one of their Excel formulas, where they erroneously excluded 5 countries from their study. If results were made reproducible since the beginning, this mistake would have been discovered way earlier – maybe in time for publication, by reviewers – and which would avoid the bad publicity.

CONVERSATION BREAK 2:

Have you had similar experiences with messy data? With your data or other people’s?


But there’s no easy conversion standard for other kinds of data, such as names. If you wanted to get all rows with the name “Nicholas Wolf”, you’d miss the rows that had “NICHOLAS WOLF”, and “Wolf, Nicholas”. Even differences in capitalization throws off most programs.

It’s easy enough to write a piece of software that ignores capitalization and punctuation. It takes a little more thinking to have that software ignore middle initials and names if they exist in certain rows but not others. You can see how doing all of this would be tedious. Enter OpenRefine!

2.2 Open Refine

OpenRefine, formerly Google Refine and before that Freebase Gridworks, is an open source tool that allows users to load data, clean it quickly and accurately, transform it, and even geocode it. The main use of OpenRefine is data cleanup and transformation to other formats. What’s more, is that all actions that were done on a dataset are stored in a project and can be replayed on another dataset!

Why Use OpenRefine?

  • Simple installation
  • Lots of great import formats: TSV, CSV, XML, RDF Triples, JSON, Google Sheets, Excel
  • Upload from local drive or import from URL
  • Many export formats: TSV, CSV, Excel, HTML table
  • Works with large-ish datasets (100,000 rows). Can adjust memory allocation to accommodate larger datasets.
  • Useful extensions: geoXtension, Opentree for phylogenetic trees from Open Tree of Life, and many more (listed here, scroll to ‘extensions’)!
  • Active development community

Getting Help

All the documentation, books, and tutorial videos are hosted at openrefine.org. There is also a Google Group where you can both search for questions (and get answers!) and post your own question (and get answers!). There is also a very detailed Wiki that helps find relevant functions quickly.

2.2.1 Installation & Running

To make sure your installation goes through smoothy, ensure you have the most updated Java JRE installed (get it from java.com/en/download/). You can then download and install Open Refine at openrefine.org/download.html.

In Windows, you can start the OpenRefine program by double-clicking on the openrefine.exe file. Java services will start automatically on your machine, and a terminal windows pops up! You don’t need to do anything with this terminal – just let it run in the background. A browser window will open in your default browser to begin your OpenRefine session. On a Mac, OpenRefine can be launched from your Applications folder. If you are using Linux, you will need to navigate to your OpenRefine directory in the command line and run ./refine.

Note: If a new browser window does not open, then go to your favourite browser and visit the URL 127.0.0.1:3333/. Even though OpenRefine works in the browser, we are using it 100% locally.

2.2.2 Starting a Project

Once OpenRefine is launched in your browser, you’ll see three options on the left sidebar: Create Project, Open Project, and Import Project. We are going to start a new project!

  1. Click Create Project and select Get data from This Computer.
  2. Click Choose Files and select the dataset you’ve downloaded for this class. Click Open or double-click on the filename.
  3. Click Next>> under the browse button to upload the data into OpenRefine.

Before you 100% import the file into your project, OpenRefine lets you preview it to make sure everything is ok - set the number of headers, set the encoding (utf-8, y’all), and or pick the right file type (if CSV vs. TSV gets confused, for instance!). If this is the wrong file, click <<Start Over on the upper left corner of the screen. If all looks well, click `Create Project>>. on the upper right corner of the screen.

Settings when we preview our data.

Settings when we preview our data.

CHALLENGE 1:

  1. Create a project in OpenRefine.
  2. Import the raw data you’ve downloaded for the class, checking to make sure that you’ve configured the import correctly in the Preview.
  3. Raise your hand to show you’ve finished!


2.2.3 Exploring Data

Now that our data is in OpenRefine, let’s check it out!

Faceting

Facets are a great way to see the big picture of your data – when you look at facets for a given column, it shows all unique entries with frequencies. You can use that to get a feel for how messy (or maybe not!) your data is. You can also use facets to subset rows that you want to change in bulk. One type of facet is called a ‘text facet’. This groups all the identical text values in a column and lists each value with the number of records it appears in. The facet information always appears in the left hand panel in the OpenRefine interface. As well as ‘Text facets’ OpenRefine also supports a range of other types of facet. These include:

  • Numeric facets
  • Timeline facets (for dates)
  • Custom facets
  • Scatterplot facets

Some of the default custom facets are:

  • Word facet - this breaks down text into words and counts the number of records each word appears in
  • Duplicates facet - this results in a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row
  • Text length facet - creates a numeric facet based on the length (number of characters) of the text in each row for the selected column. This can be useful for spotting incorrect or unusual data in a field where specific lengths are expected (e.g. if the values are expected to be years, any row with a text length more than 4 for that column is likely to be incorrect)
  • Facet by blank - a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for rows missing key data.

RECAP: ‘Facet’ groups similar values that appear in a column, and then allow you to filter the data by these values and edit values across many records at the same time.

Here we will use faceting to fix all the ridiculous characters in the univeristy column. 1. Scroll over to the university column and click 2. Click the down arrow and choose Facet > Text facet 3. In the left panel, you’ll now see a box containing every unique value in the university column along with a number representing how many times that value occurs in the column.

CHALLENGE 2:

  1. Try sorting this facet by name and by count. Do you notice any problems with the data? What are they?
  2. Hover the mouse over one of the names in the Facet list. You should see that you have an edit function available.
  3. Raise your hand to show you’ve finished!


If you find an error you want to fix, you can do it this way! OpenRefine gives you the option to edit when you hover over a facet – you could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we’ll use instead. We’ll learn about these when we talk about clustering.

BEFORE that, let’s fix some of the data! In the university column you may have noticed values that look like Lumi%C3%A8re University Lyon 2. And there are many more of these!! This is kind of ugly and not very human or machine-readable. Let’s fix it with something called GREL!

The text facet of the university column

The text facet of the university column

GREL stands for the Google Refine Expression Language, and it’s a way we can automate changes. You can use GREL to query APIs, change data formats, split columns, and a whole lot more. OpenRefine lets you choose between GREL, Python or Jython (an implementation of python designed to run on the Java platform), or Clojure (dialect of the Lisp programming language. ). GREL does what we need to do the most simply, so let’s test GREL out on the university column!

Click the download arrow to the left of the university column header. Click Edit column then Add column based on this column. A window will pop up waiting for you to input a command! The first thing we’ll do, since it’s the first box, is name our new column. The GREL snippet we’ll use is:

(value.unescape('url'))

We are telling GREL that for every value in the column, apply the unescape function designed for HTML (including URLs). You might notice that the weird symbols in the university follow a similar pattern – the % followed by some capital letters and/or numbers. This is a URL-thing.

Adding a column based on a column in OpenRefine

Adding a column based on a column in OpenRefine

One of my favourite things about OpenRefine is that when you are doing these advanced operations, it gives you a preview before you execute! The Preview tab in the window is the default, but you can also go to the Help pane to view every single command available for you – sortable, searchable, and star-able!

Now that all the names are cleaned up, we can find all the nuanced spellings and other semantic mistakes that would otherwise take forever to find. In OpenRefine, we do this with clustering. Clustering means “finding groups of different values that might be alternative representations of the same thing”. Think back to my “NICHOLAS WOLF”, “Wolf, Nicholas”, and “Nicholas Wolf” example from before. Clustering is a very powerful tool for cleaning datasets which contain misspelled or mistyped entries. OpenRefine has several clustering algorithms built in. I basically click around on between each algorithm until I make sure I get all the misspellings.

Let’s try it out!

  1. In the fixed_uniName column text facet, click the Cluster button.
  2. In the resulting pop-up window, you can flip between the algorithm configurations.
  3. There should be at least two clusters minimum you should find.
  4. Click the Merge? box beside each cluster, then click Merge Selected and Recluster to apply the corrections to the dataset.
  5. Try selecting different Methods and Keying Functions again, to see what new merges are suggested.
One type of clustering to find duplicates/misspellings

One type of clustering to find duplicates/misspellings

Another type of clustering to find duplicates/misspellings

Another type of clustering to find duplicates/misspellings

CHALLENGE 3:

  1. Pick another column of text from our dataset.
  2. Try out faceting and clustering to make sure the cells are consistent.
  3. Raise your hand to show you’ve finished!


Regular expressions?!

You can use regular expressions in GREL to powerfully repurpose and redefine your data! A regular expression, regex, is a sequence of characters that define a search pattern. There’s tons of escaping to do, lots of special characters to keep watch for, and it’s generally really hard to make the pattern. HOWEVER! Once we find the patterns to use (and there are also websites that can help us…), it’s pretty powerful.

CHALLENGE 4:

  1. Using faceting, find out how many different established values there are in the survey results.
  2. Raise your hand to show you’ve finished!


You may have noticed in the established column, there is a mix of dates and formats, such as 1963 - university status, 1918-05-01, and Chartered 1984. So, let’s try to normalize all the information in this column using GREL and regular expressions!

The one thing that everything has in common is years in YYYY format. So, in the established column, click Edit column. You can run the GREL code value.match(/.*(\d{4}).*/)[0] to find the years in each cell, and transform the cells to only have that value!

CHALLENGE 5:

  1. Is the column formatted as Text or Date?
  2. Use faceting to produce a timeline display for established. You will need to use Edit cells > Common transforms > To date to convert this column to dates.
  3. During what period were most of the universities established?
  4. Raise your hand to show you’ve finished!


Extra special Find & Replace

Regular expressions are only one way that we can transform the information in a cell. We can also use the replace and contains function to normalize our data!

CONVERSATION BREAK 3:

In the endowment columns, there are lots of stray values. Can you see a few we need to look out for?


There are a fair few things we’d like to replace. In the endowment column, click Edit column. You can run the GREL code below all to just get the

value.replace("USD","")
value.replace("US $","").replace("US$", "")
value.replace("US $","").replace("US$", "").replace("$","")

Oh no! We missed one part of the endowment column that still needs to be normalized - the original owner of this dataset added the word ‘million’ instead of adding the necessary zeroes. Let’s replace that so we can have all numerical values in the endowment column. We need to find all the values that contain (the contain function!) and replace (the replaces function!) it with six zeroes. We can use this GREL statement in the endowment column > Edit column.

value.contains("million") 
toNumber(value.replace("million", ""))*1000000

Geocoding with Google’s API

We can do a lot more than cleaning with OpenRefine as well! I use it a lot for geocoding – and we can do that in two lines of GREL! Let’s give it a go with our dataset. Navigate to the country column and click Edit column and then Add column by fetching URLs. This gives us the same window we’ve seen many times over by now. Name the new column geocode and run the following GREL code:

http://maps.google.com/maps/api/geocode/json?sensor=false&address=" + escape(value, "url")

This basically takes the contents of the country column and queries the Google maps API with each value, and gets the resulting JSON file back in the new column. We can’t do much with the JSON, though. What we really want is the lat and long coordinates! To do this, navigate to the geocode column in OpenRefine and click Edit column and then Add column based on this column. This gives us the same window we’ve seen many times over by now. Name the new column latlong and run the following GREL code:

with(value.parseJson().results[0].geometry.location, pair, pair.lat +", " + pair.lng)

This takes all the values of geocode and uses the function parseJSON to literally parase all the JSON. Google Maps API returns this bit of JSON from the query:

 "location" : { "lat" : 46.227638, "lng" : 2.213749 }

And we used parseJSON to find the lat and long in the location field of the JSON! Now, we have a column that contains the lat and long separated by a comma! This is easier to deal with – in OpenRefine, you can Edit columun and then Split into several columns and make sure you tell OpenRefine that the separator is a comma. You can then just rename the columns to lat and long respectively.

Undo some of your work

  1. Click where it says Undo / Redo on the left side of the screen. All the changes you have made so far are listed here.
  2. Click on the step that you want to go back to, in this case go back several steps to before you had done any text transformation.
  3. Visually confirm that those columns now contain the special characters that we had removed previously.
  4. Notice that you can still click on the later steps to Redo the actions. Before moving on to the next lesson, redo all the steps in your analysis so that all of the column you modified are lacking in square brackets, spaces, and single quotes.

2.2.4 Documenting our work

As you conduct your data cleaning and preliminary analysis, OpenRefine saves every change you make to the dataset. These changes are saved in a format known as JSON (JavaScript Object Notation). You can export this JSON script and apply it to other data files. If you had 20 files to clean, and they all had the same type of errors (e.g. genes encoded as dates, name misspellings, leading white spaces), and all files had the same column names, you could save the JSON script, open a new file to clean in OpenRefine, paste in the script and run it. This gives you a quick and reproducible way to clean all of your related data, across operating systems!

  1. In the Undo / Redo section, click Extract..., and select the steps that you want to apply to other datasets by clicking the check boxes.
The panel that shows us our data cleaning history as JSON

The panel that shows us our data cleaning history as JSON

  1. Copy the code from the right hand panel and paste it into a text editor (like NotePad on Windows or TextEdit on Mac). Make sure it saves as a plain text file. Let’s practice running these steps on a new dataset. We’ll test this on an uncleaned version of the dataset we’ve been working with.

  2. Create a new OpenRefine project and upload the uncleaned version of the dataset which you saved to your computer (in a particular folder in your git repository folder!).
  3. Make sure that you name the new project something different from the one we’ve been working in!
  4. Click the Undo / Redo tab > Apply and paste in the contents of txt file with the JSON code.
  5. Click Perform operations. The dataset should go through the same data cleaning operations as the processed dataset from class!

For the sake of time, we used the same dataset, but in reality you could use that script to clean any related datasets (as long as folks use the same column headers!).

2.2.5 Save & Export

When we save our OpenRefine project, we are saving not just the data, but also all the information about the cleaning and data transformation steps you’ve done. Once you’ve saved a project, you can open it up again and be just where you stopped before (and others can extend your existing work, and see the provenance of how your data was cleaned!).

OpenRefine by default autosaves your projects. If you close OpenRefine and open it up again, you’ll see a list of your projects. You can click on any one of them to open it up again.

You can also export a project, which is quite helpful for sending to collaborators (alongside your raw data) to follow (or critique…) your data cleaning steps, since the OpenRefine project contains provenance of your work. You could even elect to share this information as a supplement to a publication!

CHALLENGE 6:

  1. Click the Export button in the top right and select Export project.
  2. A tar.gz file will download to your default Download directory. The tar.gz extension tells you that this is a compressed file, which means that this file contains multiple files. You can extract the information
  3. Look at the files that appear in this folder. What files are here? What information do you think these files contain?


You can also export just your cleaned data, rather than the entire project:

  1. Click Export in the top right and select the file type you want to export the data in. Comma-separated values (csv) is typically the best choice.
  2. That file will be exported to your default Download directory, from where you can share it out!

You might be thinking, “why is CSV the best format??” Well, widely-supported, non-proprietary file formats like csv improves the ability of yourself and others to use your data, improving the chances that your work can be reproducible. CSV files can be imported into any statistical program, used by 99% of programming languages, and are plain-text and therefore trackable by git.

2.3 CONGRATS!

You can reproducibly clean your data with OpenRefine!

2.4 Further Resources

These materials are based on the following tutorials:

Other great tutorials include: