[PATCH 0/5] Improve text protocol
Bill Spitzak
spitzak at gmail.com
Tue Apr 16 09:06:25 PDT 2013
On 04/16/2013 01:16 AM, Jan Arne Petersen wrote:
> But we still need to think about how to handle invalid byte sequences
> anyways. What do we expect a toolkit to do when text with invalid byte
> sequences is inserted with commit_string? How to handle
> delete_surrounding_text with the byte offsets not matching code points?
> Should the toolkit ignore such requests or should we leave that as
> undefined behavior?
You seem to be under the impression that it is impossible to edit text
unless it is converted from UTF-8 to some other form? You do know that
there can be encoding errors in UTF-16, right?
My recommendation is that the editor store UTF-8 and preserve error
bytes. Handling of errors is a *DISPLAY* problem, not a storage problem.
Errors should show a single error glyph for each byte in the error. For
instance the sequence 0xE0,0xC0,0x20 is two error bytes followed by a
space (not a single error followed by a space or a single error as some
systems will do). The reason for this rule is to allow bi-directional
parsing of text with errors in it without having to look ahead more than
4 bytes and to match the UTF-16 encoding I describe below.
If you have old code that cannot handle Unicode unless it is translated
to UTF-16 or UTF-32, then I recommend each error byte be turned into
0xDCxx where xx is the error. This is the scheme used by Python, the
nice thing is that it is somewhat possible to invert the transformation,
and the result is invalid UTF-16 as well (it is not possible to make it
fully-invertible unless you disallow UTF-8 encoding of these code
points, which now means you cannot store invalid UTF-16 in UTF-8, which
is a much more serious problem as Windows allows filenames with invalid
UTF-16 in them).
If text is only to be displayed another possibility is to display each
error byte (and translate to UTF-16) by looking them up in the CP1252
character set. This will allow a huge majority of existing 8-bit encoded
text to display correctly and thus remove most of the need to know if
text is not in UTF-8. It is a little risky however if further processing
assigns any important meaning to ISO-8859-1 characters (using CP1252
besides making Windows text display correctly also hides the dangerous
NEL and CSI characters).
You will have to transform the text and the offsets you receive in the
input method events to UTF-16 and UTF-16 offsets. However at least both
transforms are done in the same place, so even if you don't agree with
the above proposed scheme for transformation, it will at least work.
More information about the wayland-devel
mailing list