[PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol
Joshua Watt
jpewhacker at gmail.com
Mon May 7 03:11:32 UTC 2018
On Sun, May 6, 2018 at 3:37 PM, Dorota Czaplejewicz
<dorota.czaplejewicz at puri.sm> wrote:
> On Sat, 5 May 2018 13:37:44 +0200
> Silvan Jegen <s.jegen at gmail.com> wrote:
>
>> On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:
>> > On Fri, 4 May 2018 22:32:15 +0200
>> > Silvan Jegen <s.jegen at gmail.com> wrote:
>> >
>> > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
>> > > > On Thu, 3 May 2018 21:55:40 +0200
>> > > > Silvan Jegen <s.jegen at gmail.com> wrote:
>> > > >
>> > > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:
>> > > > > > On Thu, 3 May 2018 20:47:27 +0200
>> > > > > > Silvan Jegen <s.jegen at gmail.com> wrote:
>> > > > > >
>> > > > > > > Hi Dorota
>> > > > > > >
>> > > > > > > Some comments and typo fixes below.
>> > > > > > >
>> > > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:
>> > > > > > > > + Text is valid UTF-8 encoded, indices and lengths are in code points. If a
>> > > > > > > > + grapheme is made up of multiple code points, an index pointing to any of
>> > > > > > > > + them should be interpreted as pointing to the first one.
>> > > > > > >
>> > > > > > > That way we make sure we don't put the cursor/anchor between bytes that
>> > > > > > > belong to the same UTF-8 encoded Unicode code point which is nice. It
>> > > > > > > also means that the client has to parse all the UTF-8 encoded strings
>> > > > > > > into Unicode code points up to the desired cursor/anchor position
>> > > > > > > on each "preedit_string" event. For each "delete_surrounding_text" event
>> > > > > > > the client has to parse the UTF-8 sequences before and after the cursor
>> > > > > > > position up to the requested Unicode code point.
>> > > > > > >
>> > > > > > > I feel like we are processing the UTF-8 string already in the
>> > > > > > > input-method. So I am not sure that we should parse it again on the
>> > > > > > > client side. Parsing it again would also mean that the client would need
>> > > > > > > to know about UTF-8 which would be nice to avoid.
>> > > > > > >
>> > > > > > > Thoughts?
>> > > > > >
>> > > > > > The client needs to know about Unicode, but not necessarily about
>> > > > > > UTF-8. Specifying code points is actually an advantage here, because
>> > > > > > byte offsets are inherently expressed relative to UTF-8. By counting
>> > > > > > with code points, client's internal representation can be UTF-16 or
>> > > > > > maybe even something else.
>> > > > >
>> > > > > Maybe I am misunderstanding something but the protocol specifies that
>> > > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
>> > > > > the strings are specified in Unicode points. To me that indicates that
>> > > > > the application *has to parse* the UTF-8 string into Unicode points
>> > > > > when receiving the event otherwise it doesn't know after which Unicode
>> > > > > point to draw the cursor. Of course the application can then decide to
>> > > > > convert the UTF-8 string into another encoding like UTF-16 for internal
>> > > > > processing (for whatever reason) but that doesn't change the fact that
>> > > > > it still would have to parse the incoming UTF-8 (and thus know about
>> > > > > UTF-8).
>> > > > >
>> > > > Can you see any way to avoid parsing UTF-8 in order to draw the
>> > > > cursor? I tried to come up with a way to do that, but even with
>> > > > specifying byte strings, I believe that calculating the position of
>> > > > the cursor - either in pixels or in glyphs - requires full parsing of
>> > > > the input string.
>> > >
>> > > Yes, I don't think it's avoidable either. You just don't have to do
>> > > it twice if your text rendering library consumes UTF-8 strings with
>> > > byte-offsets though. See my response below.
>> > >
>> > >
>> > > > > > There's no avoiding the parsing either. What the application cares
>> > > > > > about is that the cursor falls between glyphs. The application cannot
>> > > > > > know that in all cases. Unicode allows the same sequence to be
>> > > > > > displayed in multiple ways (fallback):
>> > > > > >
>> > > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
>> > > > > >
>> > > > > > One could make an argument that byte offsets should never be close
>> > > > > > to ZWJ characters, but I think this decision is better left to the
>> > > > > > application, which knows what exactly it is presenting to the user.
>> > > > >
>> > > > > The idea of the previous version of the protocol (from my understanding)
>> > > > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
>> > > > > falling between bytes of a Unicode code point) into the string will be
>> > > > > sent to the client. If you just get a byte-offset into a UTF-8 encoded
>> > > > > string you trust the sender to honor the protocol and thus you can just
>> > > > > pass the UTF-8 encoded string unprocessed to your text rendering library
>> > > > > (provided that the library supports UTF-8 strings which is what I am
>> > > > > assuming) without having to parse the UTF-8 string into Unicode code
>> > > > > points.
>> > > > >
>> > > > > Of course the Unicode code points will have to be parsed at some point
>> > > > > if you want to render them. Using byte-offsets just lets you do that at
>> > > > > a later stage if your libraries support UTF-8.
>> > > > >
>> > > > >
>> > > > Doesn't that chiefly depend on what kind of the text rendering library
>> > > > though? As far as I understand, passing text to rendering is necessary
>> > > > to calculate the cursor position. At the same time, it doesn't matter
>> > > > much for the calculations whether the cursor offset is in bytes or
>> > > > code points - the library does the parsing in the last step anyway.
>> > > >
>> > > > I think you mean that if the rendering library accepts byte offsets
>> > > > as the only format, the application would have to parse the UTF-8
>> > > > unnecessarily. I agree with this, but I'm not sure we should optimize
>> > > > for this case. Other libraries may support only code points instead.
>> > > >
>> > > > Did I understand you correctly?
>> > >
>> > > Yes, that's what I meant. I also assumed that no text rendering library
>> > > expects you to pass the string length in Unicode points. I had a look
>> > > and the ones I managed to find expected their lengths in bytes:
>> > >
>> > > * Pango: https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text
>> > > * Harfbuzz: https://harfbuzz.github.io/hello-harfbuzz.html
>> >
>> > I looked a bit deeper and found hb_buffer_add_utf8:
>> >
>> > https://cgit.freedesktop.org/harfbuzz/tree/src/hb-buffer.cc#n1576
>> >
>> > It seems to require both (either?) the number of bytes (for buffer
>> > size) and the number of code points in the same call. In this case, it
>> > doesn't matter how the position information is expressed.
>>
>> Haha, as an API I think that's horrible...
>>
>>
>> > > For those you would need to parse the UTF-8 string yourself first in
>> > > order to find out at which byte position the Unicodepoint stops where
>> > > the protocol wants you to draw the cursor (if the protocol sends Unicode
>> > > point offsets).
>> > >
>> > > I feel like it would make sense to optimize for the more common case. I
>> > > assume that is the one where you need to pass a length in bytes to the
>> > > text rendering library, not in Unicode points.
>> > >
>> > > Admittedly, I haven't used a lot of text rendering libraries so I would
>> > > very much like to hear more opinions on the issue.
>> > >
>> >
>> > Even if some libraries expect to work with bytes, I see three
>> > reasons not to provide them. Most importantly, I believe that we
>> > should avoid letting people shoot themselves in the foot whenever
>> > possible. Specifying bytes leaves a lot of wiggle room to communicate
>> > invalid state. The supporting reason is that protocols shouldn't be
>> > tied to implementation details.
>>
>> I agree that this is an advantage of using offsets measured in Unicode
>> code points.
>>
>> Still, it worries me to think about how for the next 10-20 years people
>> using these protocols have to parse their UTF-8 strings into Unicode
>> points twice for no good reason...
>>
>>
>> > The least important reason is that handling Unicode is getting better
>> > than it used to be. Taking Python as an example:
>> >
>>
>> That's true to some extent (personally I like Go's string and Unicode handling)
>> but Python is a bad example IMO. Python 3 handles strings this way while
>> Python 2 handels them in a completely different way:
>>
>> Python 2.7.15 (default, May 1 2018, 20:16:04)
>> [GCC 7.3.1 20180406] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> 'æþ'
>> '\xc3\xa6\xc3\xbe'
>> >>> 'æþ'[1]
>> '\xa6'
>>
>> and I am not sure either of them is easy and efficient to work with.
>>
>>
>> > >>> 'æþ'[1]
>> > 'þ'
>> > >>> len('æþ'.encode('utf-8'))
>> > 4
>> >
>> > Strings are natively indexed with code points. This matches at least
>> > my intuition when I'm asked to place a cursor somewhere inside a
>> > string and tell the index.
>>
>> Go expects all strings to be UTF-8 encoded and they are indexed by
>> byte. You can iterate over strings to get unicode points (called 'rune's
>> there) should you need them:
>>
>> for offset, r := range "æþ" {
>> fmt.Printf("start byte pos: %d, code point: %c\n", offset, r)
>> }
>>
>> start byte pos: 0, code point: æ
>> start byte pos: 2, code point: þ
>>
>> Using Go's approach you can treat strings as UTF-8 bytes if that's all
>> you want to care about while still having an easy way to parse them into
>> Unicode points if you need them.
>>
>>
>> > In the end, I'm not an expert in that area either - perhaps treating
>> > client side strings as UTF-8 buffers makes sense, but at the moment
>> > I'm still leaning towards the code point abstraction.
>>
>> Someone (™) should probably implement a client making use of the protocol
>> to see what the real world impact of this protocol change would be.
>>
>> The editor in the weston project uses pango for its text layout:
>>
>> https://cgit.freedesktop.org/wayland/weston/tree/clients/editor.c#n824
>>
>> so it would have to parse the UTF-8 string twice. The same is most likely
>> true for all programs using GTK...
>>
>>
>
> I made an attempt to dig deeper, and while I stopped short of becoming this Someone for now, I gathered what I think are some important results.
>
> First, the state of the libraries. There's a lot of data I gathered, so I'll keep this section rather dense. First, another contender for the title of text layout library, and that one uses code points exclusively:
>
> https://github.com/silnrsi/graphite/blob/master/include/graphite2/Segment.h `gr_make_seg`
>
> https://github.com/silnrsi/graphite/blob/master/tests/examples/simple.c
>
> Afterwards, I focused on GTK and Qt. As an input method plugin developer, I looked at the IM interfaces and internal data structures they expose. The results were not that clear - no mention of "code points", some references to "bytes", many to "characters" (not "chars"). What is certain is that there's a lot of converting going on behind the scenes anyway. First off, GTK seems to be moving away from bytes, judging by the comments:
>
> gtk 3.22 (`gtkimcontext.c`)
>
> `gtk_im_context_delete_surrounding`
>
>> * Asks the widget that the input context is attached to to delete
>> * characters around the cursor position by emitting the
>> * GtkIMContext::delete_surrounding signal. Note that @offset and @n_chars
>> * are in characters not in bytes which differs from the usage other
>> * places in #GtkIMContext.
>
> `gtk_im_context_get_preedit_string`
>
>> * @cursor_pos: (out): location to store position of cursor (in characters)
>> * within the preedit string.
>
> `gtk_im_context_get_surrounding`
>
>> * @cursor_index: (out): location to store byte index of the insertion
>> * cursor within @text.
>
> gtkEntry seems to store things internally as characters.
>
> While GTK using code points internally is not a proof of anything, it's a suggestion that there is a reason not to use bytes.
>
> Then, Qt, from https://doc.qt.io/qt-5/qinputmethodevent.html#setCommitString
>
>> replaceLength specifies the number of characters to be replaced
>
> a confirmation that "characters" means "code points" comes from https://doc.qt.io/qt-5/qlineedit.html#cursorPosition-prop . The value reported when "æþ|" is displayed is 2.
>
> I also spent more time than I should writing a demo implementation of an input method and a client connecting to it to check out the proposed interfaces. Predictably, it gave me a lot of trouble on the edges between bytes and code points, but I blame it on Rust's scarcity of UTF handling functions. The hack is available at https://code.puri.sm/dorota.czaplejewicz/impoc
>
> My impression at the moment is that it doesn't matter much how offsets within UTF strings are encoded, but that code points slightly better reflect what's going on in the GUI toolkits, apart from the benefits mentioned in my other emails. There seems to be so much going on behind the scenes and the parsing is so cheap that it doesn't make sense to worry about the computational aspect, just try to make things easier to get right.
>
> Unless someone chimes in with more arguments, I'm going to keep using code points in following revisions.
I don't mean to do a drive by or bikeshed, I do actually have a vested
interest in this protocol (I've implemented the previous IM protocols
on Webkit For Wayland). I've really been meaning to try it out, but
haven't yet had time. I also have quite a bit of experience with
unicode (and specifically UTF-8) due to my day job, so I wanted to
chime in...
IMHO, if you are doing UTF-8 (which you should), you should *always*
specify any offset in the string as a byte offset. I have a few
reasons for this justification:
1. Unicode is *hard*, and it has a lot of terms that people aren't
always familiar with (code points, glyphs, encodings, and the worst
overloaded term "characters"). "a byte offset in UTF-8" should be
universally and unambiguously understood.
2. Even if you specified the cursor offset as an index into a UTF-32
array of codepoints, you *still* could end up with the cursor "in
between" a printed glyph due to combining diactiricals.
3. Due to UTF-8's self syncronizing encoding, it is actually very
easy to determine if a given byte is the start of a code point, or in
the middle (and even determine *which* byte in the sequence it is).
Consequently, if you do find the offset is in the middle of a
codepoint, it is pretty trivial to either move to the next code point,
or move back to the beginning of the current code point. As such, I
have always found byte a more useful offset, because it can more
easily be converted to a code point than the other way around.
4. As more of a "gut feel" sort of thing.... A Wayland protocol is a
pretty well defined binary API (like a networking API...), and
specifying in bytes feels more "stable"... Sorry I really don't have
solid data to back that up, but I would need a lot of convincing that
codepoints were better if someone was proposing throwing this data in
a UDP packet and blasting it across a network :)
Thanks,
Joshua Watt
>
> Cheers,
> Dorota
>
> _______________________________________________
> wayland-devel mailing list
> wayland-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/wayland-devel
>
More information about the wayland-devel
mailing list