I was recently looking for information on the death penalty in the US and came across the website of Texas Department of Criminal Justice and their full list of offenders executed since 1982. Grim topic, without question, yet good example of interesting web data ready to be scraped. Here’s how I extracted data from the TDCJ website.
To get started, open this web page, select and copy all entries in the table and paste it in an Excel spreadsheet.
This should give you a nice table, including data and links to “Offender Information” and “Last Statement” pages for each record. These subpages contain additional information that we want. You can extract hyperlinks from “Link” column in Excel using a macro described in this tutorial.
Now, let’s extract the data hiding behind the linked subpages. I used Google Refine (it’s the same as Open Refine, by the way), which is a free download available on Google’s website. Once installed and executed it will run in your browser (you are running Chrome, aren’t you?). Create a project, choose to get data from This Computer and select the spreadsheet file you just created. Choose to parse data as “Excel file” and click “Create Project” button. Your imported data should look similar to this:
Each Info Link leads to the “Offender Information” page with additional data about each offender. Some of these pages contain just a scanned image of a hard copy document but many contain information in a nicely structured and consistent format like this:
Let’s say that we want to extract Date of Birth for each offender. Go to one of the sub-pages and view its source code (in Chrome, right-click on the page and select “View Page Source”. Search for “Date of Birth” and note its location in the HTML table:
The “Date of Birth” record is stored in the third <tr> and its first <td> element, and the main table’s class is “tabledata_deathrow_table”. Make a note of this and switch to Google Refine. Click on the drop-down menu of the column “Info Link”, select “Edit column” and “Add column by fetching URLs…”, give a new column a name and click OK. We have over 500 pages to fetch so grab a coffee, and when you come back you’ll have a new column ready with a bunch of HTML code in each cell. Click again on the drop-down menu of the newly created column, select “Edit column” and “Add column based on this column…”, name the column and put this code in the Expression window:
Note: remember that indexing starts at zero in Google Refine.
For the detailed explanation of this code refer to GREL (Google Refine Expression Language) documentation.
This should give you an idea on how to extract data from structured HTML pages with Google refine. Once you are done, export your table to Excel or text file and have fun with it in Tableau! I’ll post my interpretation of this data soon.