[PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

Thu May 10 09:43:12 UTC 2018

On Tue, 08 May 2018 07:07:24 +0000
Silvan Jegen <s.jegen at gmail.com> wrote:

> On Mon, May 7, 2018 at 5:11 AM Joshua Watt <jpewhacker at gmail.com> wrote:
> > IMHO, if you are doing UTF-8 (which you should), you should *always*
> > specify any offset in the string as a byte offset. I have a few
> > reasons for this justification:  
> 
> I agree with this as well. I thought some more about how to spell out my
> gut feeling on this matter in more technical terms.
> 
> UTF-8 is a byte (sequence) representation of Unicode code points. This
> indicates to me that an offset within an UTF-8-encoded string should also
> be given in bytes. Specifying the offset in Unicode points mixes the
> abstraction of the Unicode code point with (one of) its representations as
> a byte sequence. This is reflected in the fact that an offset in Unicode
> code points is not applicable to the UTF-8 string without first processing
> the string.
> 
> Unicode code points do not give us that much either since what we most
> likely want are grapheme clusters anyway (which, like any more advanced
> Unicode processing, should be handled by a specialised library):
> http://utf8everywhere.org/#myth.strlen
> 
> 
> Cheers,
> 
> Silvan

This message made me feel obliged to turn my own gut feeling into words. This is not to be construed as an argument, but more of an explanation.

I view wayland protocols as rather high level: their responsibility is to specify the type and the purpose of the data they are transporting. In this case, the data is a Unicode string, and the purpose is display. Or, the data is a number and the purpose is indexing.

I think that when a protocol starts to specify the type and purpose, it can no longer be thought as high level. In this view, indexing a Unicode string in terms of bytes would be akin to indexing any other vector of Foo in bytes. (I didn't actually check if there is any other vector, or bytes type available in wayland).

As you noted, there is some mixing between abstraction levels in the protocol. Hardcoding that it's not *just* Unicode, but also the particular encoding (UTF-8) eliminates problems with byte indexing we would have encountered if we decided to use things like Punycode (München => Mnchen-3ya). Knowing that it's always UTF-8 allows the protocol to use a tailoring indexing scheme. While I consider this a layer-breaking hack, nevertheless, this property partially counters the above reasoning.

* * *

To be honest, neither Unicode code points nor graphemes nor clusters are what we're truly looking for here. To understand what I mean, I recommend to play with this grapheme cluster:

नमस्ते

According to the Rust book [0], it's composed of 6 code points: ['न', 'म', 'स', '्', 'त', 'े'], but moving the cursor around, I would be led to believe it's 4 "pieces" long only.

Cheers,
Dorota

[0] https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20180510/3c16053a/attachment.sig>