Scraping Data with import.io and Tabula

If you have ever found yourself beginning a data journalism project and looking for the data that would power your narrative you’ve probably wondered where is this fabled “data”. We live in a digital age where datasets are now hosted by on government and NGO websites.

But despite the best efforts of these good-natured organizations sometimes you find that the data you need is not shared in a manner that is accessible to you, whether it is as a table on a website or a PDF that is shared in a failed attempt at “open data”. This is where you need to employ a hacker skill called “data scraping”.

So what is data scraping? Wikipedia defines data scraping as “A Technique in which a computer programme extracts data from a human-readable output from another programme.” Essentially taking data from one machine and putting it on another e.g. from a website to a spreadsheet.

There are a number of techniques one can employ to scrape data, from writing Python, R and Ruby scripts to using browser-based apps to do the job. But as a hack you might neither have the time or skill to use this plethora of techniques. This is where Hacks/Hackers comes into play.

On Tuesday 18 November, Hacks/Hackers hosted the first workshop in a series of workshops that took members through the data pipeline. The aim of this workshop was to teach a group of hacks and hackers data scraping using import.io and Tabula. These tools allowed participants to scrape data off of websites and PDFs respectively.

Import.io is the flagship product of the company of the same name based in London, England. The beauty of this tool is that it allows you to scrape web data with a wide scope of layouts from single tables on a webpage to multiple tables and blocks of code across a number of pages on a website using Extractors and Crawlers or combinations of both called Connectors.

Tabula was the brain-child of Knight News Fellow Manuel Aristaran who, along with other fellows Mike Tigas and Jeremy B. Merrill, developed the tool to meet a need in newsrooms to scrape data that was trapped in PDF documents that are designed to be read by printers. Tabula was recently awarded a grant by the Knight Prototype Fund and continues to be used by news organizations such as Propublica and the New York Times.

Manuel was on hand to Skype with the group and laid out the capabilities of Tabula including the ability to scrape data from PDFs comprised of more than 100 pages. Considering how government organization in South Africa are more prone on releasing releasing datasets on PDFs this affordance is more than a necessity in our local newsrooms.

Just done my 1st data hack with @siyafrica at @JoziHub using @importio. Pretty trivial example we used, but what a great feeling! #boom
— Darren Smith (@DazMSmith) November 18, 2014

During the workshop the participants used import.io to scrape and create simple APIs of the list of areas being loadshedded in the City of Cape Town during the recent electricity fiasco. They also used Tabula to scrape the City of Mogale’s loadshedding schedule which was in a PDF and would’ve taken a while to either copy over manually by hand or write a Python script to scrape.

At the end of the evening the participants walked away with a new and valuable skill that they can now take to their respective newsrooms.

If you are interested in picking up skills along the data pipeline we urge you to join the Hacks/Hackers meetup page where we’ll be posting announcements about the next set of workshops that will teach skills such as cleaning datasets using Open Refine and exploratory data analysis.

Also check out this link for the slides from the evening which have instructions on how to get started on import.io and Tabula