[utf-8] You better add Texinfo to your Bad Software list

Wed Mar 17 02:07:12 PST 2004

Hi Behdad,

Behdad Esfahbod <behdad at cs.toronto.edu> writes:

> :).  So add LaTeX and friends too.  [yes, yes, I know about Omega
> and friends enough...]

Don't be so sure — TeX itself can be made to process UTF-8 almost
completely naturally (with the problem coming on the "font side",
which is not directly related to UTF-8, and I generally feel that
Unicode is not the right solution to use as a map for font creation,
especially for the dumb decisions to support "ffi" ligatures, and yet 
not to support things like accented Cyrillic).

I remember starting a work on that in April 2002, and I may send you
a simple Plain TeX code which works at least for Serbian (both
Cyrillic and Latin [some of the codes otherwise present in
ISO-8859-2]) with specially-constructed fonts (it was a simple 1-1
correspondence between Unicode code points, with one font file
containing 256 glyphs).  The idea was very simple: make first-bytes
of any UTF-8 sequences \active (code 13, i.e. make them commands),
and calculate the wanted Unicode codepoint from the following bytes.

The problem I actually never started on was the problem of
hyphenation.  Yet, since TeX is "Turing-complete" programming
language :), you can be certain that that can be resolved as well --
though maybe not in a nice way.  It would be way better if it was
supported natively.  Perhaps it's possible to do it from TeX
hyphenation patterns as well, simply re-declaring those starting UTF-8
bytes as type \other (12) or \letter (11, I think) before the patterns.

This approach also leaves the problem of command names which cannot
contain UTF-8, unless you come up with totally brutal hacks where TeX
wouldn't be doing anything anymore.

Of course, anything else (like bi-di, paragraph rules etc.) would
have to be reprogrammed, and thus any real benefits TeX provides for 
"simple" ltr languages are not available in those cases.

I also believe LaTeX is able to treat UTF-8 fairly well (look at
examples provided with bare distribution of "xmltex", which uses
LaTeX for processing), though I don't use it, so don't hold me on
that :)

Still, I *do* agree that there should be a note regarding (La)TeX lack
of native UTF-8 support, and what are the ways to go around it
(perhaps mentioning Omega and Lambda, though it has not seen much work
in the recent years [I've been subscribed to the list for a long time,
and even forgot that I have]).

So, by all means, add the relevant info to Bad Software page -- others
will complement it with what they're aware of.

Cheers,
Danilo