[PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

Thu May 3 20:46:47 UTC 2018

On Thu, 3 May 2018 21:55:40 +0200
Silvan Jegen <s.jegen at gmail.com> wrote:

> On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:
> > On Thu, 3 May 2018 20:47:27 +0200
> > Silvan Jegen <s.jegen at gmail.com> wrote:
> >   
> > > Hi Dorota
> > > 
> > > Some comments and typo fixes below.
> > > 
> > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:  
> > > > +      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
> > > > +      grapheme is made up of multiple code points, an index pointing to any of
> > > > +      them should be interpreted as pointing to the first one.    
> > > 
> > > That way we make sure we don't put the cursor/anchor between bytes that
> > > belong to the same UTF-8 encoded Unicode code point which is nice. It
> > > also means that the client has to parse all the UTF-8 encoded strings
> > > into Unicode code points up to the desired cursor/anchor position
> > > on each "preedit_string" event. For each "delete_surrounding_text" event
> > > the client has to parse the UTF-8 sequences before and after the cursor
> > > position up to the requested Unicode code point.
> > > 
> > > I feel like we are processing the UTF-8 string already in the
> > > input-method. So I am not sure that we should parse it again on the
> > > client side. Parsing it again would also mean that the client would need
> > > to know about UTF-8 which would be nice to avoid.
> > > 
> > > Thoughts?  
> > 
> > The client needs to know about Unicode, but not necessarily about
> > UTF-8. Specifying code points is actually an advantage here, because
> > byte offsets are inherently expressed relative to UTF-8. By counting
> > with code points, client's internal representation can be UTF-16 or
> > maybe even something else.  
> 
> Maybe I am misunderstanding something but the protocol specifies that
> the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> the strings are specified in Unicode points. To me that indicates that
> the application *has to parse* the UTF-8 string into Unicode points
> when receiving the event otherwise it doesn't know after which Unicode
> point to draw the cursor. Of course the application can then decide to
> convert the UTF-8 string into another encoding like UTF-16 for internal
> processing (for whatever reason) but that doesn't change the fact that
> it still would have to parse the incoming UTF-8 (and thus know about
> UTF-8).
> 
Can you see any way to avoid parsing UTF-8 in order to draw the cursor? I tried to come up with a way to do that, but even with specifying byte strings, I believe that calculating the position of the cursor - either in pixels or in glyphs - requires full parsing of the input string.

> 
> > There's no avoiding the parsing either. What the application cares
> > about is that the cursor falls between glyphs. The application cannot
> > know that in all cases. Unicode allows the same sequence to be
> > displayed in multiple ways (fallback):
> > 
> > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > 
> > One could make an argument that byte offsets should never be close
> > to ZWJ characters, but I think this decision is better left to the
> > application, which knows what exactly it is presenting to the user.  
> 
> The idea of the previous version of the protocol (from my understanding)
> was to make sure that only valid UTF-8 and valid byte-offsets (== not
> falling between bytes of a Unicode code point) into the string will be
> sent to the client. If you just get a byte-offset into a UTF-8 encoded
> string you trust the sender to honor the protocol and thus you can just
> pass the UTF-8 encoded string unprocessed to your text rendering library
> (provided that the library supports UTF-8 strings which is what I am
> assuming) without having to parse the UTF-8 string into Unicode code
> points.
> 
> Of course the Unicode code points will have to be parsed at some point
> if you want to render them. Using byte-offsets just lets you do that at
> a later stage if your libraries support UTF-8.
> 
> 
Doesn't that chiefly depend on what kind of the text rendering library though? As far as I understand, passing text to rendering is necessary to calculate the cursor position. At the same time, it doesn't matter much for the calculations whether the cursor offset is in bytes or code points - the library does the parsing in the last step anyway.

I think you mean that if the rendering library accepts byte offsets as the only format, the application would have to parse the UTF-8 unnecessarily. I agree with this, but I'm not sure we should optimize for this case. Other libraries may support only code points instead.

Did I understand you correctly?

Cheers,
Dorota
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20180503/da16a227/attachment.sig>