Open Gazettes: We don't usually ask but we'd really appreciate some help.
Code for South Africa and SAFLII (South African Legal Information Institute) are currently creating the largest digital repository of freely available gazettes online. We’re making some excellent progress but now we need your help digitising a critical part of South Africa’s history: we need support with computing power, cloud storage, and OCR. If you can help, please email us at firstname.lastname@example.org.
It might be surprising to know but you can only access gazettes at a number of libraries and only in paper format. Digital versions are only available behind paywalls and cost a fortune to access. It might also be surprising that even government pays for expensive subscriptions to gazettes, even though they are not subject to copyright and should be available to all South Africans.
The reality is that these documents which record the history of our country, stretching back to the turn of the previous century, are starting to moulder and crumble. Many libraries are simply getting rid of their paper copies due to space constraints.It’s unlikely that we’ll lose all copies but as the number of complete sets dwindles, the risk of a devastating fire increases. Even more immediate is lack of access. While they are an asset that belong to every South African, it is becoming increasingly difficult to find them. For journalists, researchers and lawyers, trying to retrieve a particular version of a law, information about a liquidation, international treaty or even liquor licences, searching through millions of pages is impossible.
You can read more about the project and why we think it is worth doing here.
The only answer to this problem is digitisation so that we can preserve the gazettes for future generations. Unfortunately it’s expensive to do. Old books are hard to scan, they’re fragile, the paper is thin and can easily tear. For us, it’s a labour of love and we hope that we’re using our skills for the common good. To date, we have raised R110,000 which is going towards scanning 50 years of the Transvaal gazettes (for young or international readers this apartheid-era province covered the same area as Mpumalanga, Gauteng, Limpopo and parts of North West and some of KwaZulu Natal). We are also receiving donations of gazettes from various parties. Our hope is to eventually create a complete collection of the gazettes. For the moment we are scanning one gazette at a time.
It’s important to note, we don’t believe that we are the owners of this corpus, nor should it be managed by us. Unfortunately, we haven’t found anyone else interested in this project, either at libraries, parliament or the government printer. In the meanwhile, we are acting as custodians until a more suitable owner can be found. Regardless of who hosts them, these gazettes will always be free, as they should be.
As I mentioned above, we are scanning the Transvaal gazettes. To do this, there are two options: The first is to scan in full colour at archival quality where you can see the texture of the page, wrinkles and 50 year-old coffee stains. It’s the closest we can get to capturing the original documents and preserving them for future generations.
Download the entire file here
The second is to scan at a much lower quality which still allows us to OCR the text and make it searchable but is a poor reproduction of the original and doesn’t meet ISO archival standards.
Download the entire file here
This option is perfectly serviceable for our purposes but won’t be accepted by libraries looking to preserve the past. The former is the better option but it comes at a price. An archival quality scan of a 70-page gazette weighs in at around 300MB - there are thousands of gazettes and millions of pages that need to be scanned. The lower quality versions compress to 100th of the size which is an attractive compromise from the technical perspective but does not fully address our goal of replacing the paper format with a digital reproduction and helping preserve this important historical artefact.
It’s not just the storage which is a problem. We are currently running at full capacity and our developers are not able to spend the time processing the raw data for use on the OpenGazettes website while archiving the larger files for posterity. In addition to expensive scanning, high-quality OCR is not cheap.
We’re currently short on capacity, short on computing power, short on cloud storage and short on OCR capabilities and of course short on cash. We’d really appreciate help on any of these but most urgently, we need assistance with taking high quality scans and helping us process them. Given our time constraints, we are currently planning on scanning at the lower quality as “good enough”.
We don’t usually ask for support but it would be a shame to spend a ton of effort scanning without getting the best possible quality. If you’re able to help out or know someone who is, please contact us at email@example.com. Even if you can’t help, feel free to browse around the gazettes to get a feel of what it is that we are trying to save.