Digital Humanist: Tesseract OCR

Optical character recognition is a useful capability to have and although I have it on my Mac thanks to the bundleware software, OmniPage SE, included on the CD that came with my flatbed scanner, I naturally wanted to be able to do the same on my Linux machine too. A search quickly turned up Tesseract as the best option for Linux and although for some reason the new-to-Karmic Ubuntu Software Center didn't turn it up, Synaptic Package Manager did and helpfully made sure I had everything I needed in the way of related packages.

There's a passage in Suetonius in which Hyginus' life gets a paragraph treatment, and although the English has been transcribed at Lacus Curtius from an old Loeb, the Latin hasn't been - at least there; at first I didn't search too far for a text version because I wanted an excuse to try out Tesseract on a text image and Google Books has kindly put that same Loeb volume on-line with both Latin and English.

I started off using GIMP to take a screen shot of the text at the Lacus Curtius site. There are a bunch of options out there for partial and full screen shots but GIMP lets you fiddle with the image to make it more readable and save it as the .tif that Tesseract requires. It was nice and easy in Gimp, too: File -> Create -> Screen Shot, followed by saving as an uncompressed .tif.

Tesseract is a command line utility with a simple syntax: tesseract [image file] [what you want to call the file with the OCR results]; in Ubuntu you can drag the file's icon from a File Browser (=Mac OS X's Finder) to provide Terminal with the correct path. The file with the results is put in your current working directory (i.e., your user folder unless you've changed directories since starting this Terminal session) and the file name you choose for your output will be provided with a .txt suffix automatically.

I did all this and got, well, dismal results, although by squinting you can see some resemblance between the original and what it produced. Here's the first line:

111.:.22 ¤|:Z ln» |1mH·| (3ai11s -111`Ii11s I—Iy*gi1111s., a f1`eecl111a11 c»f 4\11g11st11s a

I remembered in my researches that I'd come across some advice on processing a text image before running it through Tesseract although I'd thought it wouldn't be necessary clearly something had to be done and I decided this was a promising next step. I followed steps 4.2-3 (4.1 wasn't necessary) and ran it through again and this time:

20 EI Gaius Julius Hrginus, a freednian of Augustus and a Spaniard by birth (some think that he was a native of Ale·<andria and was brought to Rome when a boy by

Not bad, not bad at all.

And how did it work on the Latin from the Loeb image?

XX. C. lulius Hyginus, Augusti libertus, naticnc
Hispanus, (nnnnulli Alexandrinum pumnt et .1
Caesarc pucmm Rnmmn adductum Alexandria cupta)

Not so great. Maybe undoing and redoing, plus increasing the Threshold (Tools->Color Tools->Threshold...) as suggested at the page above?

XX. C. Iulius Hyginus, Augusti libertus, nntiune
Hispmus, (nnunulli Alcxandrinum putunt ct a
Caesar: pucrum Rumsm adductum Alcxandrin capta)

Some gains, some losses. Overall about a wash. A little disappointing since the text was fairly clear as scans of hundred-year-old books go, still probably a little better than retyping de novo.

All in all, I'm satisfied if not delighted.

Digital Humanist

Friday, January 15, 2010

Tesseract OCR

No comments:

Blog Archive

About Me