One of the British Library Labs datasets I have been working on is 40,000 digitised playbills (posters announcing a theatrical performance pasted up or distributed on the street) from British theatres 1750-1900. This data consists of metadata (theatre name, date etc.), and pdfs containing scanned playbills.
I’m interested in visualising cultural collections over time. A quick D3 visualisation of the metadata gives a sense of the collection as a whole. Each turquoise rectangle in this diagram represents a bound volume of playbills (the collection is stored in about 260 bound volumes in the British Library - the scanned images are organised as a pdf per volume, each around 100MB). The rectangle vertical thickness correlates to the number of playbills per year contained in that volume.
I was interested in tracing the histories of repeatedly-performed plays and successful actors through these playbills. Without, however, this sort of tagging information already available for each scanned playbill, I began to explore how I might extract it myself.
This led to a couple of small experiments in applying python methods to extract the text from the playbill images. After employing the market-leaders in both open-source and commercial OCR engines (Tesseract and ABBY FineReader) with the playbills, I soon realised there are significant challenges in extracting text from this sort of historical document. Both OCR engines struggled with the non-standard typefaces and wide letter-spacing, so the accuracy was a bit disappointing. By using Python PDF parser PDFMiner, I extracted the OCR text from the playbills and output it as HTML. This way it is easy to compare the original designs with their ‘computer-read’ counterparts and pick out where they deviate.
We recently had a 'Work in Progress' exhibition at the Royal College of Art where I exhibited some of these ‘computer-read’ playbills. I wanted to point up some of the challenges when applying computer techniques in the humanities.