[cairo] non-ascii output
Bill Spitzak
spitzak at d2.com
Wed May 18 13:20:11 PDT 2005
Evan Martin wrote:
>>Similarily I recommend that all errors be drawn as though the error
>>bytes are in ISO-8859-1 or even the Microsoft CP1252 character sets.
>>This will allow virtually all 8-bit text to be drawn unchanged and thus
>>remove the need to preserve an ASCII data path and duplicate interfaces
>>through the software. Without this programs are forced to convert ASCII
>>data to UTF-8 and again they lose the incentive to convert to UTF-8
>>only. I know this runs into extreme resistance because it is considered
>>US/Euro-centric and thus politically incorrect, but I still want to try
>>to fight for this.
>
>
> While I agree with some of your other points, this is just completely
> wrong, and it's already been discussed to death.
Yes it does not sound like this is going to happen.
An acceptable solution would be to draw an error glyph for each byte in
the bad UTF-8. This will allow a programmer to substitute UTF-8 output
for all their ASCII output without the display breaking completely. They
can then work on gradually adopting their code to fix any other
difficulties they have handling UTF-8. In this scheme the error glyphs
may help by encouraging the programmer to actually fix their character
strings.
My proposed makes switching to UTF-8 trivial. But it also allows a sort
of "compressed" UTF-8 where the illegal encodings are used on purpose.
It certainly has been proven that programmers will take advantage of
this, even if told it is a bad idea. Using Unicode in the 0x80 to 0xff
range for any kind of parsing delimiters may then be dangerous, and
string compares are messed up.
I do fear that the error glyphs will be considered unacceptable, and
programmers will insert code to translate from ISO-8859-1 to UTF-8. This
would be very bad as it would make it impossible for them to ever
correctly change their code to UTF-8.
My own personal experience was with Xft. I tried to substitute the UTF-8
output calls, but was stymied by copyright symbols in about boxes, that
caused the entire copyright to disappear. There may have been some
accented letters, too, but copyrights were the big problem. This forced
me to replace the UTF-8 Xft call with a translation from "my" UTF-8 to
UCS-4 and to call the 32-bit Xft call instead. It is possible that if
the copyrights had printed an error box I would have instead tried to
fix the strings. But the no-output result was so immediatly
objectionable that I was forced to abandon the UTF-8 interface.
> Additionally, you're using ASCII and CP1252 interchangably here, which
> muddles your point; an API that accepts UTF-8 already accepts ASCII,
> so there's no need to "convert ASCII data to UTF-8".
By "ASCII" I mean 8-bit characters, not just 7-bit ones. In fact this is
more accurately called ISO-8859-1 or CP1252.
> I don't know what "preserve an ASCII data path" means, either.
Code with an "ASCII data path":
if (the_text_is_ascii) {
foo_ascii(text);
} else {
foo_utf8(text);
}
Same code without the "ASCII data path":
foo(text);
you should be able to see why programmers are going to be reluctant to
port to utf8 if they have to add an ASCII data path. Not only do they
have to add the "if" everywhere, they also have to keep track of the
extra bit of data called "the_text_is_ascii". I would also worry that
foo_utf8() may never be tested and contain horrible bugs, thus making
use of UTF-8 impossible, if the code is written such that you cannot
force the_text_is_ascii to true.
The other possible solution by programmers is going to be:
foo_utf8(convert_my_stuff_to_utf8(text)).
The problem with this is that "text" is not in UTF-8, thus we have done
nothing to get UTF-8 into files or other interfaces. In fact we have
made it worse: on current systems you can often get UTF-8 through a lot
of interfaces, but if something treats the UTF-8 as ascii and convertes
it to UTF-8 it will be destroyed.
More information about the cairo
mailing list