[cairo] non-ascii output

Wed May 18 13:20:11 PDT 2005

Evan Martin wrote:

>>Similarily I recommend that all errors be drawn as though the error
>>bytes are in ISO-8859-1 or even the Microsoft CP1252 character sets.
>>This will allow virtually all 8-bit text to be drawn unchanged and thus
>>remove the need to preserve an ASCII data path and duplicate interfaces
>>through the software. Without this programs are forced to convert ASCII
>>data to UTF-8 and again they lose the incentive to convert to UTF-8
>>only. I know this runs into extreme resistance because it is considered
>>US/Euro-centric and thus politically incorrect, but I still want to try
>>to fight for this.
> 
> 
> While I agree with some of your other points, this is just completely
> wrong, and it's already been discussed to death.

Yes it does not sound like this is going to happen.

An acceptable solution would be to draw an error glyph for each byte in 
the bad UTF-8. This will allow a programmer to substitute UTF-8 output 
for all their ASCII output without the display breaking completely. They 
can then work on gradually adopting their code to fix any other 
difficulties they have handling UTF-8. In this scheme the error glyphs 
may help by encouraging the programmer to actually fix their character 
strings.

My proposed makes switching to UTF-8 trivial. But it also allows a sort 
of "compressed" UTF-8 where the illegal encodings are used on purpose. 
It certainly has been proven that programmers will take advantage of 
this, even if told it is a bad idea. Using Unicode in the 0x80 to 0xff 
range for any kind of parsing delimiters may then be dangerous, and 
string compares are messed up.

I do fear that the error glyphs will be considered unacceptable, and 
programmers will insert code to translate from ISO-8859-1 to UTF-8. This 
would be very bad as it would make it impossible for them to ever 
correctly change their code to UTF-8.

My own personal experience was with Xft. I tried to substitute the UTF-8 
output calls, but was stymied by copyright symbols in about boxes, that 
caused the entire copyright to disappear. There may have been some 
accented letters, too, but copyrights were the big problem. This forced 
me to replace the UTF-8 Xft call with a translation from "my" UTF-8 to 
UCS-4 and to call the 32-bit Xft call instead. It is possible that if 
the copyrights had printed an error box I would have instead tried to 
fix the strings. But the no-output result was so immediatly 
objectionable that I was forced to abandon the UTF-8 interface.

> Additionally, you're using ASCII and CP1252 interchangably here, which
> muddles your point; an API that accepts UTF-8 already accepts ASCII,
> so there's no need to "convert ASCII data to UTF-8".

By "ASCII" I mean 8-bit characters, not just 7-bit ones. In fact this is 
more accurately called ISO-8859-1 or CP1252.

> I don't know what "preserve an ASCII data path" means, either.

Code with an "ASCII data path":

	if (the_text_is_ascii) {
		foo_ascii(text);
	} else {
		foo_utf8(text);
	}

Same code without the "ASCII data path":

	foo(text);

you should be able to see why programmers are going to be reluctant to 
port to utf8 if they have to add an ASCII data path. Not only do they 
have to add the "if" everywhere, they also have to keep track of the 
extra bit of data called "the_text_is_ascii". I would also worry that 
foo_utf8() may never be tested and contain horrible bugs, thus making 
use of UTF-8 impossible, if the code is written such that you cannot 
force the_text_is_ascii to true.

The other possible solution by programmers is going to be:

	foo_utf8(convert_my_stuff_to_utf8(text)).

The problem with this is that "text" is not in UTF-8, thus we have done 
nothing to get UTF-8 into files or other interfaces. In fact we have 
made it worse: on current systems you can often get UTF-8 through a lot 
of interfaces, but if something treats the UTF-8 as ascii and convertes 
it to UTF-8 it will be destroyed.