[PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol
s.jegen at gmail.com
Thu May 10 12:29:48 UTC 2018
On Thu, May 10, 2018 at 11:46:32AM +0200, Dorota Czaplejewicz wrote:
> On Thu, 10 May 2018 11:43:12 +0200
> Dorota Czaplejewicz <dorota.czaplejewicz at puri.sm> wrote:
> > On Tue, 08 May 2018 07:07:24 +0000
> > Silvan Jegen <s.jegen at gmail.com> wrote:
> > > On Mon, May 7, 2018 at 5:11 AM Joshua Watt <jpewhacker at gmail.com> wrote:
> > > > IMHO, if you are doing UTF-8 (which you should), you should *always*
> > > > specify any offset in the string as a byte offset. I have a few
> > > > reasons for this justification:
> > >
> > > I agree with this as well. I thought some more about how to spell out my
> > > gut feeling on this matter in more technical terms.
> > >
> > > UTF-8 is a byte (sequence) representation of Unicode code points. This
> > > indicates to me that an offset within an UTF-8-encoded string should also
> > > be given in bytes. Specifying the offset in Unicode points mixes the
> > > abstraction of the Unicode code point with (one of) its representations as
> > > a byte sequence. This is reflected in the fact that an offset in Unicode
> > > code points is not applicable to the UTF-8 string without first processing
> > > the string.
> > >
> > > Unicode code points do not give us that much either since what we most
> > > likely want are grapheme clusters anyway (which, like any more advanced
> > > Unicode processing, should be handled by a specialised library):
> > > http://utf8everywhere.org/#myth.strlen
> > >
> > >
> > > Cheers,
> > >
> > > Silvan
> > This message made me feel obliged to turn my own gut feeling into
> > words. This is not to be construed as an argument, but more of an
> > explanation.
> > I view wayland protocols as rather high level: their responsibility
> > is to specify the type and the purpose of the data they are
> > transporting. In this case, the data is a Unicode string, and the
> > purpose is display. Or, the data is a number and the purpose is
> > indexing.
> > I think that when a protocol starts to specify the type and purpose,
> > it can no longer be thought as high level. In this view, indexing a
> > Unicode string in terms of bytes would be akin to indexing any other
> > vector of Foo in bytes. (I didn't actually check if there is any
> > other vector, or bytes type available in wayland).
> > As you noted, there is some mixing between abstraction levels in
> > the protocol. Hardcoding that it's not *just* Unicode, but also the
> > particular encoding (UTF-8) eliminates problems with byte indexing
> > we would have encountered if we decided to use things like Punycode
> > (München => Mnchen-3ya). Knowing that it's always UTF-8 allows the
> > protocol to use a tailoring indexing scheme. While I consider this a
> > layer-breaking hack, nevertheless, this property partially counters
> > the above reasoning.
> > * * *
> > To be honest, neither Unicode code points nor graphemes nor clusters
> > are what we're truly looking for here. To understand what I mean, I
> > recommend to play with this grapheme cluster:
> > नमस्ते
> > According to the Rust book , it's composed of 6 code points:
> > ['न', 'म', 'स', '्', 'त', 'े'], but moving the cursor
> > around, I would be led to believe it's 4 "pieces" long only.
> > Cheers,
> > Dorota
> >  https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html
> On a second thought, perhaps graphemes are actually the relevant thing here...
Yes, that's also mentioned in the rust book:
and what I mentioned in my mail.
I agree with what is mentioned in http://utf8everywhere.org/#myth.strlen
which is that Unicode code points are almost never what people making
use of the protocol would want:
"Yet, the number of code points in it is irrelevant to almost any software
engineering task, with perhaps the only exception of converting the
string to UTF-32"
So instead just specifying a byte offset (thus not mixing layers of
abstraction) and leaving more specialized Unicode handling (if desired by
the client) to specialized libraries seems like the best way to go.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 488 bytes
Desc: not available
More information about the wayland-devel