[PATCH 0/5] Improve text protocol

Mon Apr 15 12:14:17 PDT 2013

Jan Arne Petersen wrote:

> * Changes offsets to be Unicode character instead of byte based

No, PLEASE DON'T DO THIS!!!

You think you are making things "easier" but you are making it much much 
harder. You may not believe it, but "how many characters are in this 
UTF-8" will generate dozens of different answers and should never be 
used as part of a communication api. Possible differences:

1. A lot of things really count UTF-16 code units, not Unicode code 
points, due to being designed for Windows.

2. Handling of invalid byte sequences. Some consider one byte a 
character, some consider up to 4 bytes stopping at the first byte that 
fails the UTF-8 parsing, some consider all trailing bytes no matter how 
long, some consider the N bytes determined by the lead byte no matter 
what they are (the first is the most common and the first two are the 
only ones recommended, but the others exist, sometimes multiple rules in 
the same decoder!). And don't you dare spout the nonsense that somehow 
invalid byte sequences won't happen, or that if they are there it is 
"not UTF-8" and thus somehow saying this means it will magically not 
ever go through the API.

3. Disagreement about whether the encoding of UTF-16 surrogate halves, 
the characters 0xNNFFFE and 0xNNFFFF, the C0 and C1 control characters, 
code points greater than 0x10FFFF, etc, are "characters" or "errors". If 
errors many decoders count them as 3 or 4 characters rather than one.

4. How to count combining characters.

5. How to count double-width characters, tabs, various whitespace.

6. Normalization. Almost anything that actually wants to decode Unicode 
(other than to translate it to UTF-16 for Windows filenames) wants to do 
extra analysis and will do normalization. This is hundreds of pages of 
documentation from Unicode and certainly should not be part of a 
low-level api.

PS: You will notice that Windows and everything else working with UTF-16 
count the surrogate pairs as 2 units. For reasons that totally baffle 
me, the very same people who say "oh you must measure your UTF-8 in 
'character'" see nothing wrong with this! Why don't you think a little: 
go change all your UTF-16 code to measure "characters" and realize what 
a STUPID idea it is.