[PATCH 0/5] Improve text protocol

Tue Apr 16 09:06:25 PDT 2013

On 04/16/2013 01:16 AM, Jan Arne Petersen wrote:

> But we still need to think about how to handle invalid byte sequences
> anyways. What do we expect a toolkit to do when text with invalid byte
> sequences is inserted with commit_string? How to handle
> delete_surrounding_text with the byte offsets not matching code points?
> Should the toolkit ignore such requests or should we leave that as
> undefined behavior?

You seem to be under the impression that it is impossible to edit text 
unless it is converted from UTF-8 to some other form? You do know that 
there can be encoding errors in UTF-16, right?

My recommendation is that the editor store UTF-8 and preserve error 
bytes. Handling of errors is a *DISPLAY* problem, not a storage problem.

Errors should show a single error glyph for each byte in the error. For 
instance the sequence 0xE0,0xC0,0x20 is two error bytes followed by a 
space (not a single error followed by a space or a single error as some 
systems will do). The reason for this rule is to allow bi-directional 
parsing of text with errors in it without having to look ahead more than 
4 bytes and to match the UTF-16 encoding I describe below.

If you have old code that cannot handle Unicode unless it is translated 
to UTF-16 or UTF-32, then I recommend each error byte be turned into 
0xDCxx where xx is the error. This is the scheme used by Python, the 
nice thing is that it is somewhat possible to invert the transformation, 
and the result is invalid UTF-16 as well (it is not possible to make it 
fully-invertible unless you disallow UTF-8 encoding of these code 
points, which now means you cannot store invalid UTF-16 in UTF-8, which 
is a much more serious problem as Windows allows filenames with invalid 
UTF-16 in them).

If text is only to be displayed another possibility is to display each 
error byte (and translate to UTF-16) by looking them up in the CP1252 
character set. This will allow a huge majority of existing 8-bit encoded 
text to display correctly and thus remove most of the need to know if 
text is not in UTF-8. It is a little risky however if further processing 
assigns any important meaning to ISO-8859-1 characters (using CP1252 
besides making Windows text display correctly also hides the dangerous 
NEL and CSI characters).

You will have to transform the text and the offsets you receive in the 
input method events to UTF-16 and UTF-16 offsets. However at least both 
transforms are done in the same place, so even if you don't agree with 
the above proposed scheme for transformation, it will at least work.