*-ISO10646-C encoding

Thu Sep 23 06:44:38 PDT 2004

Roland Mainz wrote on 2004-09-22 20:45 UTC:
> This functionality is already broken since the introduction of the
> *-iso10646-1 encoding (that's one of the reasons why nons of the Unix
> vendors like Sun etc. have adopted *-iso10646-1 yet) ... we already need
> a "smarter" mechanism. The COMPOUND-TEXT vs. UTF-8 issue doesn't change
> that.

The *-iso10646-1 font encoding has its own bag of big problems and is
not a long-term solution:

  - The Xlib API and X11 protocol data structures used for representing
    font metric information are extremely inefficient when handling
    sparsely populated fonts. The most common way of accessing a font
    in an X client is a call to XLoadQueryFont(), which allocates
    memory for an XFontStruct and fetches its content from the server.
    XFontStruct contains an array of XCharStruct entries (12 bytes each).
    The size of this array is the code position of the last character
    minus the code position of the first character plus one.
    Therefore, any "*-iso10646-1" font that contains both U+0020 and U+FFFD
    will cause an XCharStruct array with 65502 elements to be allocated
    (even for CharCell fonts), which requires 786 kilobytes (!) of client-side
    memory and data transmission, even if the font contains only a
    thousand characters. No fun for anyone with less than a 10 Mbit/s
    connection.

  - *-iso10646-1 fonts cannot contain characters above U+00FFFF, e.g.
    the extended mathematical alphabets that a number of people would like
    to use cannot be added at present (and adding *-iso10646-2 as originally
    envisioned will not be a solution for the other two reasons)

  - *-iso10646-1 fonts assume that characters and glyphs are the same
    thing, which makes them useless in particular for Indic scripts.

They will eventually have to be replaced by something that has been
discussed before as the "*-iso10646-c" encoding. There the position in
the font is really a font-dependent glyph number, and the font
properties encode compact tables to map between Unicode characters and
glyphs. This will solve all of the above problems, without requiring any
changes to X servers. It is therefore an attractive solution, especially
for old X terminals with fixed server firmware over slow connections,
where none of the newer font technologies like render can be used. There
are still plenty of these used with modern X11 clients. The
"*-iso10646-c" fonts can be specified to be "*-iso8859-1" backwards
compatible by preserving the 1:1 character glyph relationship for
anything below U+0100.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#x11

Markus

-- 
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__