Saturday, January 2, 2010

Punctuation and Diacriticals in Ubuntu

One hope — one might even call it a resolution — for this year is to get my Hyginus project published in some form. I'm fairly close to finishing my mark-up of the text into XeLaTeX, which is tedious but can't easily be automated. After that I plan to do a thorough revision with more comments in the source file and to check for additions to the literature on the Fabulae published since I finished the original version in AppleWorks (which was long enough ago that the application had stopped being developed but Apple hadn't yet totally pulled the plug on it).

Since I'm also hoping to learn more about Linux, I'm working on this project mostly on my ThinkPad rather than my Mac. Since I'm also continuing my quest to make the key commands of Vim second-nature so it will be in practice and not just in theory faster than a conventional text editor, I've been doing my mark-up using that editor; some time ago I started using it and added useful macros for LaTeX mark-up to my .vimrc file so I can add a variety of tags and move to the next line with easy pairs of keystrokes.

A few days ago while writing HTML mark-up in Vim for my day job, I was reminded again that I wasn't able to do my usual find-and-replace for single and double curly (a.k.a. "smart") quotes; in the past I've shrugged my shoulders and opened the file with gedit and used its more obvious interface to switch them. In Mac OS X you can type those curly characters easily with combinations of option, shift, and square brackets, but those particular key combinations don't work the same way in Ubuntu (or possibly Gnome or Linux in general). The obvious solution of copying and pasting that I use in my favorite mouse-centric Mac text editor, Smultron (sadly no longer under development), is foiled by Vim's modal nature: hitting p to paste simply types "p" into the find space in a substitute command. What to do?

After using gedit to clean up my work code and uploading the file, I remembered that sometime in the past I'd seen a Vim command that would provide information about a character under the cursor. A few seconds' search turned it up: ga. Now, would it be possible to use that information to type a character not transparently accessible?

As it turns out, yes, although it wasn't as obvious as it might have been. Using ga prints a line at the bottom of the window like this one for a question mark:

63, Hex 3f, Octal 077

Checking charts of code points it turned out that 63 is the decimal value of the question mark. I eventually found in the Vim documentation that using Control-V, with or without an additional letter, would let me type any character I wanted if I knew its code points; Control-V u, 16-bit hexadecimal, is most useful because — if my understanding of Unicode code ranges is correct — it'll work for any Western language.

Turning to Hyginus, the challenge there lies in producing macrons. One of the first things I did when I installed Ubuntu was to try to figure out how to add diacritical marks like accents, and I learned quickly that it's necessary to select a modifier key in System->Preferences->Keyboard->Layouts->Layout Options->Compose Key Position, which gives you seven options (although my keyboard lacks two of the choices). I picked the right control key and that's worked well for me. Key combinations for common (and some uncommon, like the dot-less i, ı, used in Turkish) Latin alphabet diacriticals are explained at this helpful page but macrons are not among those listed.

I found a partial answer by looking inside my /usr/share/X11/locale/en_US.UTF-8/Compose file: macrons are made by using the compose key and underscores. However, oddly, using the same keystrokes that successfully overline e, i, u, and y, a and o are instead transformed into the feminine and masculine ordinal indicators, small superscript letters that are underlined in some fonts as well. Unfortunately I'm not able to interpret the Compose file, which is not transparent in its meaning.

Interestingly, e, i, and u are all overlined by right-control--hyphen as well as by right-control--underscore (i.e., the shift key isn't necessary), but to overline y the shift key has to be used or the Japanese yen currency character is produced.

Fortunately, there's a simple solution for macrons for the 5 common vowels: switching to the Maori keyboard, a trick I learned from an early version of Mac OS X. Adding the Maori keyboard is pretty simple: System->Preferences->Keyboard->Layouts->Add...

And that's a handy ellipsis, because it inspired me to think to check whether Ubuntu might include something comparable to OS X's US Extended keyboard, and in fact there's a "USA International (with dead keys)" which creates macrons for all 5 vowels the way I´d expected. On the other hand, it also requires use of the right ALT key to create single and double straight quotes, so it's a mixed blessing.

Next layout research project? Polytonic Greek, which is apparently not well-served in its ancient form by the built-in "Greece Polytonic" keyboard. I'll probably start here.

3 comments:

Simon said...

In Linux and all GTK+ applications, you can type arbitrary Unicode characters with the combination

Ctrl+Shift+u hex-unicode-value Space

This covers the whole of Unicode, including Plane 1 (the Vim shortcut covers only up to codepoint 0xFFFF).

If you want to type all sort of latin characters with diacritics, I think the best choice is to select a keyboard layout that has dead keys for all those diacritics. I happen to have a UK keyboard, and the default GB keyboard layout has 13 of them. For example, áâäãȩạȧǎőåāąṣṡṩ. Note the ṩ; they can be stucked if Unicode says so.

How do you get ṩ? AltGr + / AltGr + Shift + / s = ṩ

If you use the US keyboard layout, select “USA - Alternative international (former us_intl)” in they keyboard layout settings. This is the equivalent to the default UK layout with all those dead keys.

The keyboard settings show where these dead key characters are located on your keyboard. You can also see them in /usr/share/X11/xkb/symbols/gb and /usr/share/X11/xkb/symbols/us.

Note that for Latin and Greek and the instructions above, we produce 'pre-composed' Unicode characters instead several Unicode characters (that is, a base Unicode character followed by one or more diacritics). It's the proper thing to produce pre-composed Unicode characters if they are available for the character we want to type. All systems do this.

Compose sequences are also interesting, with ⑩ⓔ etc. Personally I would use them for those characters that cannot be served by a keyboard layout. The compose file you mention in the post is not the file used verbatim in your distribution. This file is processed and compressed, and a few sequences are different, http://git.gnome.org/browse/gtk+/tree/gtk/gtkimcontextsimpleseqs.h

Greek and Greek Polytonic are now served by the default Greek keyboard layout (much like the default British keyboard layout). The link you mention describes how to set it up. In the layout settings you may see a 'Greece Polytonic' variant; this one is deprecated. We prefer users to try the default Greek keyboard layout for Greek, Greek Polytonic and Ancient Greek characters (ασδλκάἕᾅϡϲϟϐʹϖϛ).

Greek Polytonic and XeLaTeX work together well. An issue is that it may not work out of the box on your distro. you may need to perform some extra steps. The latest issue of the Eutypon journal (p. 45) has some more information, http://eutypon.gr/e-blog/

krok said...

"Greek and Greek Polytonic are now served by the default Greek keyboard layout"

I am very frustrated! Where is the centuries old layout? I cannot type like that! HELP!

proteusx said...

http://www.vim.org/scripts/script.php?script_id=2743