[PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

Silvan Jegen s.jegen at gmail.com
Thu May 10 12:29:48 UTC 2018


On Thu, May 10, 2018 at 11:46:32AM +0200, Dorota Czaplejewicz wrote:
> On Thu, 10 May 2018 11:43:12 +0200
> Dorota Czaplejewicz <dorota.czaplejewicz at puri.sm> wrote:
> 
> > On Tue, 08 May 2018 07:07:24 +0000
> > Silvan Jegen <s.jegen at gmail.com> wrote:
> > 
> > > On Mon, May 7, 2018 at 5:11 AM Joshua Watt <jpewhacker at gmail.com> wrote:  
> > > > IMHO, if you are doing UTF-8 (which you should), you should *always*
> > > > specify any offset in the string as a byte offset. I have a few
> > > > reasons for this justification:    
> > > 
> > > I agree with this as well. I thought some more about how to spell out my
> > > gut feeling on this matter in more technical terms.
> > > 
> > > UTF-8 is a byte (sequence) representation of Unicode code points. This
> > > indicates to me that an offset within an UTF-8-encoded string should also
> > > be given in bytes. Specifying the offset in Unicode points mixes the
> > > abstraction of the Unicode code point with (one of) its representations as
> > > a byte sequence. This is reflected in the fact that an offset in Unicode
> > > code points is not applicable to the UTF-8 string without first processing
> > > the string.
> > > 
> > > Unicode code points do not give us that much either since what we most
> > > likely want are grapheme clusters anyway (which, like any more advanced
> > > Unicode processing, should be handled by a specialised library):
> > > http://utf8everywhere.org/#myth.strlen
> > > 
> > > 
> > > Cheers,
> > > 
> > > Silvan  
> > 
> > This message made me feel obliged to turn my own gut feeling into
> > words. This is not to be construed as an argument, but more of an
> > explanation.
> > 
> > I view wayland protocols as rather high level: their responsibility
> > is to specify the type and the purpose of the data they are
> > transporting. In this case, the data is a Unicode string, and the
> > purpose is display. Or, the data is a number and the purpose is
> > indexing.
> > 
> > I think that when a protocol starts to specify the type and purpose,
> > it can no longer be thought as high level. In this view, indexing a
> > Unicode string in terms of bytes would be akin to indexing any other
> > vector of Foo in bytes. (I didn't actually check if there is any
> > other vector, or bytes type available in wayland).
> > 
> > As you noted, there is some mixing between abstraction levels in
> > the protocol. Hardcoding that it's not *just* Unicode, but also the
> > particular encoding (UTF-8) eliminates problems with byte indexing
> > we would have encountered if we decided to use things like Punycode
> > (München => Mnchen-3ya). Knowing that it's always UTF-8 allows the
> > protocol to use a tailoring indexing scheme. While I consider this a
> > layer-breaking hack, nevertheless, this property partially counters
> > the above reasoning.
> > 
> > * * *
> > 
> > To be honest, neither Unicode code points nor graphemes nor clusters
> > are what we're truly looking for here. To understand what I mean, I
> > recommend to play with this grapheme cluster:
> > 
> > नमस्ते
> > 
> > According to the Rust book [0], it's composed of 6 code points:
> > ['न', 'म', 'स', '्', 'त', 'े'], but moving the cursor
> > around, I would be led to believe it's 4 "pieces" long only.
> > 
> > Cheers,
> > Dorota
> > 
> > [0] https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html
> 
> On a second thought, perhaps graphemes are actually the relevant thing here...

Yes, that's also mentioned in the rust book:

https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html#bytes-and-scalar-values-and-grapheme-clusters-oh-my

and what I mentioned in my mail.

I agree with what is mentioned in http://utf8everywhere.org/#myth.strlen
which is that Unicode code points are almost never what people making
use of the protocol would want:

"Yet, the number of code points in it is irrelevant to almost any software
engineering task, with perhaps the only exception of converting the
string to UTF-32"

So instead just specifying a byte offset (thus not mixing layers of
abstraction) and leaving more specialized Unicode handling (if desired by
the client) to specialized libraries seems like the best way to go.


Cheers,

Silvan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20180510/07ba4f11/attachment.sig>


More information about the wayland-devel mailing list