Improving Open-Source OCR of Historical Books

white board table magic

See this project’s code on GitHub!

We propose to develop a system that will automate the transcription of eight French volumes printed in the 17th century. This project is interdisciplinary in nature, as it will use expertise of faculty and students from Computer Science (CS) and French departments as well as library representatives. The project will build upon and extend the work undertaken by Laura Burch (Assistant Professor of French, College of Wooster) on a separate project, also funded by the library’s Digital Scholarship grant. Laura Burch’s project has resulted in hand-transcribed and lightly encoded text files of Madeleine de Scudéry’s Conversations Sur Divers Sujets (1680). Because hand transcription is time-consuming and laborious, we propose to work with two CS students on an independent research assignment in which they will establish an automated method for computers to “read” the pdf pages of the remainder of de Scudéry’s corpus. The students will start from the knowledgebase created with the Early Modern OCR Project in establishing a process by which the machines will read the texts using optical character recognition, or OCR.

Laura Burch uses these volumes in classroom, but the students have difficulties understanding ­some fonts and following long descriptions spanning many pages. An electronic version of these texts will help the students with the reading and with a broader and deeper understanding of these texts. Since the ultimate goal of the project is a digital edition, the text produced by these OCR processes will be used as the basis for student annotations of the Scudéry books. It will also allow them to further analyze the text at a larger scale by using a research method known as “distant reading.” For instance the students will be able to use additional existing tools to better understand these texts: Voyant is a tool that allows visualization of some particular words across chapters or even pages. Besides being useful in the classroom, a digital version of these texts can be used in future I.S. research and, through our library, they eventually will also be available to everyone interested in these texts.

An additional pedagogical dimension to this research involves the CS student participants. They will apply learning algorithms (such as neural networks) studied in CS 310 Machine Intelligence class to real world problems (such as text recognition). This is a valuable learning experience for them, as they will develop critical thinking skills in using the theory from classroom in problem solving.

Bellow we identify four major steps for this project.

  • Parsing and training The first step is to train a Tesseract Optical Character Recognition software to recognize the special font present in these 17th century texts. For this, we will be using Aletheia software to parse the pdf text into individual letters. Fig. 1 illustrates the three major steps required for this training: (i) Aletheia parser converts the pdf input into individual characters; (ii) Franken+ tool transforms these individual image-characters into inputs recognized by Tesseract system (which is a neural network); (iii) finally, Tesseract will learn to associate correct letters to corresponding inputs.
  • Transcribing the volumes As illustrated in Fig. 2, the scanned pages of the six volumes will be processed by the trained OCR, producing a plain text file.
  • Postprocessing In this step we process the obtained text file for errors such as incomplete words due to spots of ink or other imperfections in the original pages (see Fig. 2).
  • Exporting to various useful formats The plain text will be translated to the Text Encoding Initiative’s (TEI) format, a stable and well-supported standard from which other formats (e.g. HTML for web-readability) can be generated. In this way we will ensure for portability and further usability.