[PATCH 0/5] Improve text protocol

Tue Apr 16 01:16:53 PDT 2013

Hi,

On 04/15/2013 09:14 PM, Bill Spitzak wrote:
> Jan Arne Petersen wrote:
> 
>> * Changes offsets to be Unicode character instead of byte based
> 
> No, PLEASE DON'T DO THIS!!!
>
> You think you are making things "easier" but you are making it much much
> harder.

My main reason was that EFL, IBus and partly GTK+ were using Unicode
characters as offsets and I did not want to have to specify how to
handle 'invalid' byte offsets.

> You may not believe it, but "how many characters are in this
> UTF-8" will generate dozens of different answers and should never be
> used as part of a communication api.

"Unicode characters" is indeed not good enough for a protocol
specification. I should have written "Unicode code points" instead. But
even with that we still have the problem with invalid byte sequences. So
I do not really mind using byte offsets.

But we still need to think about how to handle invalid byte sequences
anyways. What do we expect a toolkit to do when text with invalid byte
sequences is inserted with commit_string? How to handle
delete_surrounding_text with the byte offsets not matching code points?
Should the toolkit ignore such requests or should we leave that as
undefined behavior?

> 1. A lot of things really count UTF-16 code units, not Unicode code
> points, due to being designed for Windows.
> 
> 2. Handling of invalid byte sequences. Some consider one byte a
> character, some consider up to 4 bytes stopping at the first byte that
> fails the UTF-8 parsing, some consider all trailing bytes no matter how
> long, some consider the N bytes determined by the lead byte no matter
> what they are (the first is the most common and the first two are the
> only ones recommended, but the others exist, sometimes multiple rules in
> the same decoder!). And don't you dare spout the nonsense that somehow
> invalid byte sequences won't happen, or that if they are there it is
> "not UTF-8" and thus somehow saying this means it will magically not
> ever go through the API.
> 
> 3. Disagreement about whether the encoding of UTF-16 surrogate halves,
> the characters 0xNNFFFE and 0xNNFFFF, the C0 and C1 control characters,
> code points greater than 0x10FFFF, etc, are "characters" or "errors". If
> errors many decoders count them as 3 or 4 characters rather than one.
> 
> 4. How to count combining characters.
> 
> 5. How to count double-width characters, tabs, various whitespace.
> 
> 6. Normalization. Almost anything that actually wants to decode Unicode
> (other than to translate it to UTF-16 for Windows filenames) wants to do
> extra analysis and will do normalization. This is hundreds of pages of
> documentation from Unicode and certainly should not be part of a
> low-level api.

-- 
Jan Arne Petersen
Openismus GmbH
http://www.openismus.com