[PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

Mon May 7 20:09:57 UTC 2018

Hi Joshua

On Sun, May 06, 2018 at 10:11:32PM -0500, Joshua Watt wrote:
> On Sun, May 6, 2018 at 3:37 PM, Dorota Czaplejewicz
> <dorota.czaplejewicz at puri.sm> wrote:
> > Unless someone chimes in with more arguments, I'm going to keep
> > using code points in following revisions.
> 
> I don't mean to do a drive by or bikeshed, I do actually have a vested
> interest in this protocol (I've implemented the previous IM protocols
> on Webkit For Wayland). I've really been meaning to try it out, but
> haven't yet had time. I also have quite a bit of experience with
> unicode (and specifically UTF-8) due to my day job, so I wanted to
> chime in...
> 
> IMHO, if you are doing UTF-8 (which you should), you should *always*
> specify any offset in the string as a byte offset. I have a few
> reasons for this justification:
>  1. Unicode is *hard*, and it has a lot of terms that people aren't
> always familiar with (code points, glyphs, encodings, and the worst
> overloaded term "characters"). "a byte offset in UTF-8" should be
> universally and unambiguously understood.
>  2. Even if you specified the cursor offset as an index into a UTF-32
> array of codepoints, you *still* could end up with the cursor "in
> between" a printed glyph due to combining diactiricals.

This case should be covered by the following paragraph in the protocol
spec:

+      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
+      grapheme is made up of multiple code points, an index pointing to any of
+      them should be interpreted as pointing to the first one.

>  3. Due to UTF-8's self syncronizing encoding, it is actually very
> easy to determine if a given byte is the start of a code point, or in
> the middle (and even determine *which* byte in the sequence it is).
> Consequently, if you do find the offset is in the middle of a
> codepoint, it is pretty trivial to either move to the next code point,
> or move back to the beginning of the current code point. As such, I
> have always found byte a more useful offset, because it can more
> easily be converted to a code point than the other way around.

This property of UTF-8 only makes it easier to recover from an issue
you won't have to deal with at all if you specify the offsets in Unicode
code points...

>  4. As more of a "gut feel" sort of thing.... A Wayland protocol is a
> pretty well defined binary API (like a networking API...), and
> specifying in bytes feels more "stable"... Sorry I really don't have
> solid data to back that up, but I would need a lot of convincing that
> codepoints were better if someone was proposing throwing this data in
> a UDP packet and blasting it across a network :)

I am afraid gut feels don't count. And I am with you on this :P

Cheers,

Silvan