COMPOUND_TEXT versus UTF8_STRING

Thu Sep 23 06:31:07 PDT 2004

Roland Mainz wrote on 2004-09-22 17:01 UTC:
> ... or "enhanche" the ISO2022/COMPUND_TEXT spec (see my other email in
> this thread) to include an escape-sequence which says: "UTF-8 starts
> here..." :)

You can't encode in COMPOUND_TEXT any character in UTF-8 for which there
is already an existing COMPOUND_TEXT encoding (e.g., JIS). This would
break all the COMPOUND_TEXT applications that do not yet know about
UTF-8. It is therefore far more attractive to support UTF-8 on its own,
cleanly and completely separate from COMPOUND_TEXT, namely in
UTF8_STRING.

You don't make the world any simply by adding yet more mechanics to
COMPOUND_TEXT, whereas UTF8_STRING is almost as simple as STRING.

I don't object to adding UTF-8 to COMPOUND_TEXT, but only with the
string restriction that it is exclusively used for those characters for
which there is no existing support in COMPOUND_TEXT. A UTF-8 sequence
encapsulated with ESC %G and ESC %@ (the official ISO IR codes for
switching between ISO 2022 and UTF-8) is still better than just a
question mark. But that is no substitute for introducing UTF8_STRING as
the eventual replacement for COMPOUND_TEXT. It is just a hack to help
avoid information loss in case some Unicode text ever got mangled by
accident through COMPOUND_TEXT.

Markus

-- 
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__