Some time ago, we developed the SABC Truth Commission Special Report website on behalf of the South African History Archive. Recently, the site went live. While there are other websites that contain the transcripts and/or Final Report documents in various formats, this site is the first to contain a search engine that can search the details of all TRC hearings and the TRC Final Report, as well as the Victims List, TV series transcripts and a glossary.
The primary focus of this website is to make the SABC’s TRC Special Report series available to the world, with some context. All the available TRC documents were collated, encoded, and linked to relevant episodes. We began with a DVD, produced by the SABC, of all the episodes in the TRC Special Report series; text files of the transcripts of all the TRC hearings (about 4000 HTML files) and the TRC Final Report in 80 PDF files.
The Final Report was a difficult beast to tame. We had the entire report in PDF format, but we needed to be able to produce results down to a specific paragraph. The best we could do with the PDFs themselves was a page number within a chapter and, even then, the PDF page numbers did not correspond to the published page numbers. We had to extract the information from the PDFs. To do this:
- We split the PDFs into subsections. This involved manually paging through each PDF adding bookmarks, then splitting each document by those bookmarks.
- We then converted those PDFs into HTML using the built-in Acrobat converter. It turned out that the original PDFs had been made with a really early version of the PDF encoder, and were made originally in old versions of Word. The combination meant that the HTML we ended up with only vaguely resembled what we needed. Paragraphs and heading levels were all muddled up, lists and tables were jumbled, images and graphs were mis-sized and/or incorrectly positioned, etc.
- We were quite proud of the editor we created to allow us to fix the code. It allowed us to split, combine or move paragraphs; promote or demote headings; add or edit images, tables and lists and do all the other little tasks we needed to clean up the Final Report code.
- Of course, someone then needed to manually go through the entire 6 volumes of the final report and fix it all. This was a long and laborious job, but the worst part was that we had to read the entire report over and over again as we checked and rechecked the formatting.
The transcripts of the TRC hearings were even more difficult to import. These came to us as a set of HTML files, but this didn't help us very much. The files were encoded into HTML by a number of different people*, who in turn were working on text transcripts from a previous group. No particular standards seemed to have been set for either group to follow. Some transcripts held multiple hearings, each separated by dashes or dots or hyphens or simply with an extra line break. Conversely, some hearings spanned several transcripts, with little meta-data in the documents to identify which hearing was covered. There was no standard naming convention for files, nor was there any consistent standard to the formatting of any of the content.
Our approach was to write a parser script that would use pattern recognition to extract the relevant information from the transcripts. We identified 7 major formats, and wrote parsers for each of them. Within each of those formats there were dozens of variations, and countless exceptions and complications. Over a period of weeks, we would run the scripts (which would take several hours), then check the data for issues and errors. We’d update the parsers to cater to those new issues and then run it all again. This process continued until we’d achieved an acceptable level of accuracy. By then we’d identified all the variations, and most of the exceptions and complications.
The process was a technical challenge, but also a difficult task emotionally – many of the stories that came through the TRC were terrifying and heart-breaking, and working through the project required us to reread these stories repeatedly. However, we are immensely proud of the part Black Square played in making these stories, along with the SABC’s excellent TV series, so accessible. Congratulations to SAHA and the SABC for making this marvelous resource available in anticipation of the 10th anniversary (on 21 March 2013) of the presentation to Government of the TRC Final Report.
*One of whom was Steve Crawford, who encoded his name into the HTML code as comments one letter at a time.