Compose sequence standard

Wed Jun 4 10:51:54 EEST 2003

As modern unix desktops are moving toward unicode, it'd be nice if there were standard
compose sequences available to access a reasonable subset of unicode on latin alphabet
keyboards. (I'm thinking especially of American ones, simply because there are quite a lot
of them out there, and mine is one of them.)

A while back I made a stab at implementing a wider set of compose sequences in gtk, but
for this to really be useful it ought to be a standard across all X desktop programs,
which is why I am appealing to this forum.

I expect that it should be possible to expand my list a little, and I make no guarantees
that there aren't any outright bugs, but it's a starting point. My version of the file in
gtk with my expanded list of compose sequences is at:

http://www.xprt.net/~munizao/hacks/gtkimcontextsimple.c

The compose sequence table should be pretty self-explanatory even for non-gtk folks.
(This worked with gtk+ 2.0.6 and 2.0.7. I haven't had time to keep it up to date.)

I tried to follow the following rules in coming up with my set of compose sequences.

1) Only two-character compose sequences are allowed.

2) Preëxisting compose sequences used for the various Latin-n character sets in X should
be retained.

3) Mnemonics for compose sequences should not be based on a particular language. One
example of such a mnemonic is 'v' + 'b' for vertical bar, (which is retained due to 2.)

4) Mnemonics for compose sequences should make sense visually. However, I'm willing to
bend this for the sake of coverage. (For example, I have 'o' + ':' → 'ő'. I hope
Hungarians will forgive me, but it was the best I could do.)

5) Mnemonics for compose sequnces should be consistent. '<vowel>' + '-' should always
produce the <vowel> WITH MACRON character. But coverage should be weighed against
consistency. I added 's' + '-' → 'ſ', which might be a little inconsistent, but it was the
best sequence I could find to cover that character.

6) Compose sequences should produce the same character as the corresponding reversed
sequences. (But not necessarily if reversing them breaks the mnemonic. For example, ':' +
')' → '☺' (WHITE SMILING FACE) but ')' + ':' → '☹' (WHITE FROWNING FACE))

7) Sequences should cover:
a) All characters in some ISO Latin-n set.
b) All of the Latin Extended-A range.
c) Characters for which there are compose sequences with mnemonics that are analogous to
those used in a) or b). For example:  '2' + '^' → '²', so '9' + '^' should produce '⁹'.
d) Characters that fill out the character gamuts of languages partially covered by a, b,
and c.
e) Other characters which one might expect to be reasonably widely used and for which
reasonably intuitive mnemonics exist. These include mathematical symbols like '≥', arrows,
   characters in the Letterlike Symbols range like '℞' and '™', and a few others.

In some cases there will be conflicts where a given compose sequence might reasonably
apply to multiple characters. These should be decided on an ad hoc basis where the
wideness of use of the character, the intuitiveness of the compose sequence, and the
presence of reasonable alternative compose sequences for the character that loses the
conflict are all considered.

So, I want to know: would a standard expanded compose sequence list be used if I were to
make one, (with, of course, the input of concerned parties?) Are there other factors that
should be taken into account in developing such a list? I know that each character added
would add a few bytes of memory to the compose sequence tables. Is this likely to be a
problem?

**Ali