[PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol
Dorota Czaplejewicz
dorota.czaplejewicz at puri.sm
Sat May 5 09:09:10 UTC 2018
On Fri, 4 May 2018 22:32:15 +0200
Silvan Jegen <s.jegen at gmail.com> wrote:
> On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
> > On Thu, 3 May 2018 21:55:40 +0200
> > Silvan Jegen <s.jegen at gmail.com> wrote:
> >
> > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:
> > > > On Thu, 3 May 2018 20:47:27 +0200
> > > > Silvan Jegen <s.jegen at gmail.com> wrote:
> > > >
> > > > > Hi Dorota
> > > > >
> > > > > Some comments and typo fixes below.
> > > > >
> > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:
> > > > > > + Text is valid UTF-8 encoded, indices and lengths are in code points. If a
> > > > > > + grapheme is made up of multiple code points, an index pointing to any of
> > > > > > + them should be interpreted as pointing to the first one.
> > > > >
> > > > > That way we make sure we don't put the cursor/anchor between bytes that
> > > > > belong to the same UTF-8 encoded Unicode code point which is nice. It
> > > > > also means that the client has to parse all the UTF-8 encoded strings
> > > > > into Unicode code points up to the desired cursor/anchor position
> > > > > on each "preedit_string" event. For each "delete_surrounding_text" event
> > > > > the client has to parse the UTF-8 sequences before and after the cursor
> > > > > position up to the requested Unicode code point.
> > > > >
> > > > > I feel like we are processing the UTF-8 string already in the
> > > > > input-method. So I am not sure that we should parse it again on the
> > > > > client side. Parsing it again would also mean that the client would need
> > > > > to know about UTF-8 which would be nice to avoid.
> > > > >
> > > > > Thoughts?
> > > >
> > > > The client needs to know about Unicode, but not necessarily about
> > > > UTF-8. Specifying code points is actually an advantage here, because
> > > > byte offsets are inherently expressed relative to UTF-8. By counting
> > > > with code points, client's internal representation can be UTF-16 or
> > > > maybe even something else.
> > >
> > > Maybe I am misunderstanding something but the protocol specifies that
> > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> > > the strings are specified in Unicode points. To me that indicates that
> > > the application *has to parse* the UTF-8 string into Unicode points
> > > when receiving the event otherwise it doesn't know after which Unicode
> > > point to draw the cursor. Of course the application can then decide to
> > > convert the UTF-8 string into another encoding like UTF-16 for internal
> > > processing (for whatever reason) but that doesn't change the fact that
> > > it still would have to parse the incoming UTF-8 (and thus know about
> > > UTF-8).
> > >
> > Can you see any way to avoid parsing UTF-8 in order to draw the
> > cursor? I tried to come up with a way to do that, but even with
> > specifying byte strings, I believe that calculating the position of
> > the cursor - either in pixels or in glyphs - requires full parsing of
> > the input string.
>
> Yes, I don't think it's avoidable either. You just don't have to do
> it twice if your text rendering library consumes UTF-8 strings with
> byte-offsets though. See my response below.
>
>
> > > > There's no avoiding the parsing either. What the application cares
> > > > about is that the cursor falls between glyphs. The application cannot
> > > > know that in all cases. Unicode allows the same sequence to be
> > > > displayed in multiple ways (fallback):
> > > >
> > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > > >
> > > > One could make an argument that byte offsets should never be close
> > > > to ZWJ characters, but I think this decision is better left to the
> > > > application, which knows what exactly it is presenting to the user.
> > >
> > > The idea of the previous version of the protocol (from my understanding)
> > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
> > > falling between bytes of a Unicode code point) into the string will be
> > > sent to the client. If you just get a byte-offset into a UTF-8 encoded
> > > string you trust the sender to honor the protocol and thus you can just
> > > pass the UTF-8 encoded string unprocessed to your text rendering library
> > > (provided that the library supports UTF-8 strings which is what I am
> > > assuming) without having to parse the UTF-8 string into Unicode code
> > > points.
> > >
> > > Of course the Unicode code points will have to be parsed at some point
> > > if you want to render them. Using byte-offsets just lets you do that at
> > > a later stage if your libraries support UTF-8.
> > >
> > >
> > Doesn't that chiefly depend on what kind of the text rendering library
> > though? As far as I understand, passing text to rendering is necessary
> > to calculate the cursor position. At the same time, it doesn't matter
> > much for the calculations whether the cursor offset is in bytes or
> > code points - the library does the parsing in the last step anyway.
> >
> > I think you mean that if the rendering library accepts byte offsets
> > as the only format, the application would have to parse the UTF-8
> > unnecessarily. I agree with this, but I'm not sure we should optimize
> > for this case. Other libraries may support only code points instead.
> >
> > Did I understand you correctly?
>
> Yes, that's what I meant. I also assumed that no text rendering library
> expects you to pass the string length in Unicode points. I had a look
> and the ones I managed to find expected their lengths in bytes:
>
> * Pango: https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text
> * Harfbuzz: https://harfbuzz.github.io/hello-harfbuzz.html
I looked a bit deeper and found hb_buffer_add_utf8:
https://cgit.freedesktop.org/harfbuzz/tree/src/hb-buffer.cc#n1576
It seems to require both (either?) the number of bytes (for buffer size) and the number of code points in the same call. In this case, it doesn't matter how the position information is expressed.
>
> For those you would need to parse the UTF-8 string yourself first in
> order to find out at which byte position the Unicodepoint stops where
> the protocol wants you to draw the cursor (if the protocol sends Unicode
> point offsets).
>
> I feel like it would make sense to optimize for the more common case. I
> assume that is the one where you need to pass a length in bytes to the
> text rendering library, not in Unicode points.
>
> Admittedly, I haven't used a lot of text rendering libraries so I would
> very much like to hear more opinions on the issue.
>
Even if some libraries expect to work with bytes, I see three reasons not to provide them. Most importantly, I believe that we should avoid letting people shoot themselves in the foot whenever possible. Specifying bytes leaves a lot of wiggle room to communicate invalid state. The supporting reason is that protocols shouldn't be tied to implementation details.
The least important reason is that handling Unicode is getting better than it used to be. Taking Python as an example:
>>> 'æþ'[1]
'þ'
>>> len('æþ'.encode('utf-8'))
4
Strings are natively indexed with code points. This matches at least my intuition when I'm asked to place a cursor somewhere inside a string and tell the index.
In the end, I'm not an expert in that area either - perhaps treating client side strings as UTF-8 buffers makes sense, but at the moment I'm still leaning towards the code point abstraction.
Cheers,
Dorota
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20180505/cd8f6965/attachment.sig>
More information about the wayland-devel
mailing list