[PATCH 0/5] Improve text protocol
Bill Spitzak
spitzak at gmail.com
Mon Apr 15 12:14:17 PDT 2013
Jan Arne Petersen wrote:
> * Changes offsets to be Unicode character instead of byte based
No, PLEASE DON'T DO THIS!!!
You think you are making things "easier" but you are making it much much
harder. You may not believe it, but "how many characters are in this
UTF-8" will generate dozens of different answers and should never be
used as part of a communication api. Possible differences:
1. A lot of things really count UTF-16 code units, not Unicode code
points, due to being designed for Windows.
2. Handling of invalid byte sequences. Some consider one byte a
character, some consider up to 4 bytes stopping at the first byte that
fails the UTF-8 parsing, some consider all trailing bytes no matter how
long, some consider the N bytes determined by the lead byte no matter
what they are (the first is the most common and the first two are the
only ones recommended, but the others exist, sometimes multiple rules in
the same decoder!). And don't you dare spout the nonsense that somehow
invalid byte sequences won't happen, or that if they are there it is
"not UTF-8" and thus somehow saying this means it will magically not
ever go through the API.
3. Disagreement about whether the encoding of UTF-16 surrogate halves,
the characters 0xNNFFFE and 0xNNFFFF, the C0 and C1 control characters,
code points greater than 0x10FFFF, etc, are "characters" or "errors". If
errors many decoders count them as 3 or 4 characters rather than one.
4. How to count combining characters.
5. How to count double-width characters, tabs, various whitespace.
6. Normalization. Almost anything that actually wants to decode Unicode
(other than to translate it to UTF-16 for Windows filenames) wants to do
extra analysis and will do normalization. This is hundreds of pages of
documentation from Unicode and certainly should not be part of a
low-level api.
PS: You will notice that Windows and everything else working with UTF-16
count the surrogate pairs as 2 units. For reasons that totally baffle
me, the very same people who say "oh you must measure your UTF-8 in
'character'" see nothing wrong with this! Why don't you think a little:
go change all your UTF-16 code to measure "characters" and realize what
a STUPID idea it is.
More information about the wayland-devel
mailing list