Saturday, January 23, 2010

Firefox 3.6 in Ubuntu

One problem with Linux is that it's common that the latest releases of cross-platform applications aren't easily available or as easily installed as on Mac and Windows computers, as Preston Gralla recently observed. Although I would dearly like to have disagreed with him, I have to say that after two unsuccessful attempts to install the recently released Firefox 3.6 using two different sets of instructions, I have to agree he's got a point. Although I succeeded with the directions here, I didn't know that I had without logging in and out, something I guessed at rather than was told. Other than that it was, as promised, "super easy": three simple steps with a few "y"s and a password.

My consistent experience is that Ubuntu is excellent for most purposes for "ordinary" users if you don't want to do anything unusual, including installing the latest versions of some software (Firefox and TeX Live among them). If you do, you may get lucky and be able to find instructions that will work the first time around; if not, possibly the second or third. But — and this is something I think about both Mac OS X and Windows — it really doesn't seem like it should be this hard. There are certain things I prefer about Ubuntu's software installation: Opening up a Terminal window and typing sudo apt-get install [whatever] works surprisingly often and is even easier than the process of finding and installing new software in Mac OS X and (in my very limited experience) Windows. But why can't it always be easy?

In my work as de facto tech support where I work (the one-eyed man among the more-or-less blind) when I look at what I have to do to fix even some simple things, things that I do so commonly they've often become second-nature to me at this point, I realize that to ordinary users they're entirely unintuitive. Admittedly complexity is a price that must be paid for power, but does so much have to be that complex? Well, maybe someday; here we are in 2010 and I still don't have my jet pack, after all.

Friday, January 15, 2010

Tesseract OCR

Optical character recognition is a useful capability to have and although I have it on my Mac thanks to the bundleware software, OmniPage SE, included on the CD that came with my flatbed scanner, I naturally wanted to be able to do the same on my Linux machine too. A search quickly turned up Tesseract as the best option for Linux and although for some reason the new-to-Karmic Ubuntu Software Center didn't turn it up, Synaptic Package Manager did and helpfully made sure I had everything I needed in the way of related packages.

There's a passage in Suetonius in which Hyginus' life gets a paragraph treatment, and although the English has been transcribed at Lacus Curtius from an old Loeb, the Latin hasn't been - at least there; at first I didn't search too far for a text version because I wanted an excuse to try out Tesseract on a text image and Google Books has kindly put that same Loeb volume on-line with both Latin and English.

I started off using GIMP to take a screen shot of the text at the Lacus Curtius site. There are a bunch of options out there for partial and full screen shots but GIMP lets you fiddle with the image to make it more readable and save it as the .tif that Tesseract requires. It was nice and easy in Gimp, too: File -> Create -> Screen Shot, followed by saving as an uncompressed .tif.

Tesseract is a command line utility with a simple syntax: tesseract [image file] [what you want to call the file with the OCR results]; in Ubuntu you can drag the file's icon from a File Browser (=Mac OS X's Finder) to provide Terminal with the correct path. The file with the results is put in your current working directory (i.e., your user folder unless you've changed directories since starting this Terminal session) and the file name you choose for your output will be provided with a .txt suffix automatically.

I did all this and got, well, dismal results, although by squinting you can see some resemblance between the original and what it produced. Here's the first line:

111.:.22 ¤|:Z ln» |1mH·| (3ai11s -111`Ii11s I—Iy*gi1111s., a f1`eecl111a11 c»f 4\11g11st11s a

I remembered in my researches that I'd come across some advice on processing a text image before running it through Tesseract although I'd thought it wouldn't be necessary clearly something had to be done and I decided this was a promising next step. I followed steps 4.2-3 (4.1 wasn't necessary) and ran it through again and this time:

20 EI Gaius Julius Hrginus, a freednian of Augustus and a Spaniard by birth (some think that he was a native of Ale·<andria and was brought to Rome when a boy by

Not bad, not bad at all.

And how did it work on the Latin from the Loeb image?

XX. C. lulius Hyginus, Augusti libertus, naticnc
Hispanus, (nnnnulli Alexandrinum pumnt et .1
Caesarc pucmm Rnmmn adductum Alexandria cupta)

Not so great. Maybe undoing and redoing, plus increasing the Threshold (Tools->Color Tools->Threshold...) as suggested at the page above?

XX. C. Iulius Hyginus, Augusti libertus, nntiune
Hispmus, (nnunulli Alcxandrinum putunt ct a
Caesar: pucrum Rumsm adductum Alcxandrin capta)

Some gains, some losses. Overall about a wash. A little disappointing since the text was fairly clear as scans of hundred-year-old books go, still probably a little better than retyping de novo.

All in all, I'm satisfied if not delighted.

Saturday, January 2, 2010

Punctuation and Diacriticals in Ubuntu

One hope — one might even call it a resolution — for this year is to get my Hyginus project published in some form. I'm fairly close to finishing my mark-up of the text into XeLaTeX, which is tedious but can't easily be automated. After that I plan to do a thorough revision with more comments in the source file and to check for additions to the literature on the Fabulae published since I finished the original version in AppleWorks (which was long enough ago that the application had stopped being developed but Apple hadn't yet totally pulled the plug on it).

Since I'm also hoping to learn more about Linux, I'm working on this project mostly on my ThinkPad rather than my Mac. Since I'm also continuing my quest to make the key commands of Vim second-nature so it will be in practice and not just in theory faster than a conventional text editor, I've been doing my mark-up using that editor; some time ago I started using it and added useful macros for LaTeX mark-up to my .vimrc file so I can add a variety of tags and move to the next line with easy pairs of keystrokes.

A few days ago while writing HTML mark-up in Vim for my day job, I was reminded again that I wasn't able to do my usual find-and-replace for single and double curly (a.k.a. "smart") quotes; in the past I've shrugged my shoulders and opened the file with gedit and used its more obvious interface to switch them. In Mac OS X you can type those curly characters easily with combinations of option, shift, and square brackets, but those particular key combinations don't work the same way in Ubuntu (or possibly Gnome or Linux in general). The obvious solution of copying and pasting that I use in my favorite mouse-centric Mac text editor, Smultron (sadly no longer under development), is foiled by Vim's modal nature: hitting p to paste simply types "p" into the find space in a substitute command. What to do?

After using gedit to clean up my work code and uploading the file, I remembered that sometime in the past I'd seen a Vim command that would provide information about a character under the cursor. A few seconds' search turned it up: ga. Now, would it be possible to use that information to type a character not transparently accessible?

As it turns out, yes, although it wasn't as obvious as it might have been. Using ga prints a line at the bottom of the window like this one for a question mark:

63, Hex 3f, Octal 077

Checking charts of code points it turned out that 63 is the decimal value of the question mark. I eventually found in the Vim documentation that using Control-V, with or without an additional letter, would let me type any character I wanted if I knew its code points; Control-V u, 16-bit hexadecimal, is most useful because — if my understanding of Unicode code ranges is correct — it'll work for any Western language.

Turning to Hyginus, the challenge there lies in producing macrons. One of the first things I did when I installed Ubuntu was to try to figure out how to add diacritical marks like accents, and I learned quickly that it's necessary to select a modifier key in System->Preferences->Keyboard->Layouts->Layout Options->Compose Key Position, which gives you seven options (although my keyboard lacks two of the choices). I picked the right control key and that's worked well for me. Key combinations for common (and some uncommon, like the dot-less i, ı, used in Turkish) Latin alphabet diacriticals are explained at this helpful page but macrons are not among those listed.

I found a partial answer by looking inside my /usr/share/X11/locale/en_US.UTF-8/Compose file: macrons are made by using the compose key and underscores. However, oddly, using the same keystrokes that successfully overline e, i, u, and y, a and o are instead transformed into the feminine and masculine ordinal indicators, small superscript letters that are underlined in some fonts as well. Unfortunately I'm not able to interpret the Compose file, which is not transparent in its meaning.

Interestingly, e, i, and u are all overlined by right-control--hyphen as well as by right-control--underscore (i.e., the shift key isn't necessary), but to overline y the shift key has to be used or the Japanese yen currency character is produced.

Fortunately, there's a simple solution for macrons for the 5 common vowels: switching to the Maori keyboard, a trick I learned from an early version of Mac OS X. Adding the Maori keyboard is pretty simple: System->Preferences->Keyboard->Layouts->Add...

And that's a handy ellipsis, because it inspired me to think to check whether Ubuntu might include something comparable to OS X's US Extended keyboard, and in fact there's a "USA International (with dead keys)" which creates macrons for all 5 vowels the way I´d expected. On the other hand, it also requires use of the right ALT key to create single and double straight quotes, so it's a mixed blessing.

Next layout research project? Polytonic Greek, which is apparently not well-served in its ancient form by the built-in "Greece Polytonic" keyboard. I'll probably start here.