[cairo] [PATCH 3/3] [test] Use UTF-8 in test files

Bill Spitzak spitzak at gmail.com
Tue Mar 10 11:39:11 PDT 2015



On 03/10/2015 04:10 AM, Andrea Canciani wrote:
> From: Andrea Canciani <ranma42 at gmail.com>
>
> On MacOSX, the sed utility errors out when parsing non-UTF8
> files.

Holy crap! Sorry but I have been ranting against this sort of stupidity 
for years, but nobody seems to pay attention.

Note that it is impossible to make a sed script that will correct the 
non-UTF-8 into UTF-8. Therefore the authors are actually HURTING the 
transition to UTF-8, not helping as they so foolishly believe.

The Apple or BSD engineers who wrote this are idiots.

Text stream reading should NEVER NEVER NEVER throw an error on any 
unexpected bytes, and should be able to deal with any byte pattern and 
distinguish it from any different byte pattern.

The best way to do this is to stop using UTF-16 or UTF-32 internally, 
and just deal with UTF-8 directly. It is not hard at all. You can parse 
a UTF-8 stream in both directions with very little code, even a stream 
containing errors. Don't panic, and realize that sed and every other 
text tool has been dealing with words and lines and sentences and 
paragraphs for 50 years despite the horrific fact that they are 
"variable length" and will have NO trouble dealing with variable length 
"characters". And you may even start to handle combining characters 
correctly once you get over the fixed-size delusion.

If you really can't stand that, please make your converter from UTF-8 to 
internal just turn error bytes into a replacement character (a different 
one for each of the 128 possible error bytes, the high bit is set on all 
of them). For UTF-16 turn them into 0xDC80..0xDCFF, which are nice 
because they are technically invalid UTF-16. For UTF-32 you have the 
option of turning them into some value greater than 0x10FFFF so you can 
distinguish them from correctly-encoded 0xDC80..0xDCFF.

In any case fixing text files so they are UTF-8 is a good idea so this 
is a good patch. But it would be nice to not be forced by bugs to do this.


More information about the cairo mailing list