text mining


“Between Alexandria and Babel” Dr. Paul Dilley presents at OWU

A couple weeks ago, Dr. Paul Dilley visited Ohio Wesleyan as a capstone (of sorts) to Dr. David Eastman’s grant-funded endeavors. Dr. Dilley is himself involved in a number of digital humanities projects at the University of Iowa and beyond, including multi-spectral imaging of ancient books (think here of the discovery of the Archimedes Palimpsest), the creation of Big Ancient Mediterranean (or BAM, an open-access project allowing the exploration and visualization of ancient texts in new ways), and the painstaking transcription and digitization of the Chester Beatty Kephalaia Codex.

Banner image for BAM.Especially in the case of the latter, we might imagine Dilley as any traditional scholar of (and in) the archives, laboriously attending to the minute details of the (barely) visible word, identified only as a trace. In his talk, Dilley led the audience from this imagined space into the virtual space afforded by computational technologies. Indeed, he made this connection explicit in framing his particular brand of digital scholarship as a kind of “digital philology,” which he defines in his forthcoming publication as a set of “[n]ew scholarly interpretive practices that both produce, and are enacted by, the transfer of texts from manuscripts and the printed page to digital files subject to computational analysis and visualization” (“Between Alexandria and Babel,” forthcoming).

The proliferation of these digital textual files affords scholars across disciplines the opportunities to push against the boundaries of traditional scholarly inquiry. Before discussing two ways in which we might do this – rich annotation and distant reading – Dilley underlined the importance of open, clean, and linked datasets. As a case in point he noted how propriety licenses on some collections of digital data may limit a researcher to “keyword search, which reproduces the functionality of the print index.” This, he adds, is to say nothing of the potential unreliability of the search: OCR’d textual data is often suspect, and one never knows what potential search returns may be missing as a result of bad OCR (optical character recognition). [Full disclosure: my last job was with the Early Modern OCR Project at Texas A&M, so bad OCR and/in old books are subjects near to my heart.]

Nevertheless, research opportunities for large corpora will also proliferate, allowing researchers to encode analytical readings into texts and to read them distantly. The latter, you may know, is a way of interpreting from a 10,000 foot view large collections of text that are otherwise unreadable by humans. While it is not dependent upon the former, encoded text, distant reading and other digital research are deepeneded by this brand of “rich annotation.” On this point Dilley cites the “Gentle Introduction to XML“:

“Encoding a text for computer processing is, in principle, like transcribing a manuscript from scriptio continua; it is a process of making explicit what is conjectural or implicit, a process of directing the user as to how the content of the text should be (or has been) interpreted.”

To Professor Dilley’s mind, the kind of large-scale work that one might do on a digital corpus – be it mark-up (i.e. encoding), distant reading, or any other interpretive practice afforded by computational methods – may best be thought of as “extended intelligence”: in no way “artificial,” it is a very real (and heretofore impossible) supplement to conventional literary criticism.

I’m sure Professor Dilley would love to hear from you if you have questions about his specific projects. If you’d like to know more about text encoding or distant reading of textual data (or bad OCR for that matter!), feel free to contact me!