[PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

Silvan Jegen s.jegen at gmail.com
Tue May 8 07:07:24 UTC 2018


On Mon, May 7, 2018 at 5:11 AM Joshua Watt <jpewhacker at gmail.com> wrote:
> IMHO, if you are doing UTF-8 (which you should), you should *always*
> specify any offset in the string as a byte offset. I have a few
> reasons for this justification:

I agree with this as well. I thought some more about how to spell out my
gut feeling on this matter in more technical terms.

UTF-8 is a byte (sequence) representation of Unicode code points. This
indicates to me that an offset within an UTF-8-encoded string should also
be given in bytes. Specifying the offset in Unicode points mixes the
abstraction of the Unicode code point with (one of) its representations as
a byte sequence. This is reflected in the fact that an offset in Unicode
code points is not applicable to the UTF-8 string without first processing
the string.

Unicode code points do not give us that much either since what we most
likely want are grapheme clusters anyway (which, like any more advanced
Unicode processing, should be handled by a specialised library):
http://utf8everywhere.org/#myth.strlen


Cheers,

Silvan


More information about the wayland-devel mailing list