[PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol
s.jegen at gmail.com
Fri May 4 20:32:15 UTC 2018
On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
> On Thu, 3 May 2018 21:55:40 +0200
> Silvan Jegen <s.jegen at gmail.com> wrote:
> > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:
> > > On Thu, 3 May 2018 20:47:27 +0200
> > > Silvan Jegen <s.jegen at gmail.com> wrote:
> > >
> > > > Hi Dorota
> > > >
> > > > Some comments and typo fixes below.
> > > >
> > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:
> > > > > + Text is valid UTF-8 encoded, indices and lengths are in code points. If a
> > > > > + grapheme is made up of multiple code points, an index pointing to any of
> > > > > + them should be interpreted as pointing to the first one.
> > > >
> > > > That way we make sure we don't put the cursor/anchor between bytes that
> > > > belong to the same UTF-8 encoded Unicode code point which is nice. It
> > > > also means that the client has to parse all the UTF-8 encoded strings
> > > > into Unicode code points up to the desired cursor/anchor position
> > > > on each "preedit_string" event. For each "delete_surrounding_text" event
> > > > the client has to parse the UTF-8 sequences before and after the cursor
> > > > position up to the requested Unicode code point.
> > > >
> > > > I feel like we are processing the UTF-8 string already in the
> > > > input-method. So I am not sure that we should parse it again on the
> > > > client side. Parsing it again would also mean that the client would need
> > > > to know about UTF-8 which would be nice to avoid.
> > > >
> > > > Thoughts?
> > >
> > > The client needs to know about Unicode, but not necessarily about
> > > UTF-8. Specifying code points is actually an advantage here, because
> > > byte offsets are inherently expressed relative to UTF-8. By counting
> > > with code points, client's internal representation can be UTF-16 or
> > > maybe even something else.
> > Maybe I am misunderstanding something but the protocol specifies that
> > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> > the strings are specified in Unicode points. To me that indicates that
> > the application *has to parse* the UTF-8 string into Unicode points
> > when receiving the event otherwise it doesn't know after which Unicode
> > point to draw the cursor. Of course the application can then decide to
> > convert the UTF-8 string into another encoding like UTF-16 for internal
> > processing (for whatever reason) but that doesn't change the fact that
> > it still would have to parse the incoming UTF-8 (and thus know about
> > UTF-8).
> Can you see any way to avoid parsing UTF-8 in order to draw the
> cursor? I tried to come up with a way to do that, but even with
> specifying byte strings, I believe that calculating the position of
> the cursor - either in pixels or in glyphs - requires full parsing of
> the input string.
Yes, I don't think it's avoidable either. You just don't have to do
it twice if your text rendering library consumes UTF-8 strings with
byte-offsets though. See my response below.
> > > There's no avoiding the parsing either. What the application cares
> > > about is that the cursor falls between glyphs. The application cannot
> > > know that in all cases. Unicode allows the same sequence to be
> > > displayed in multiple ways (fallback):
> > >
> > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > >
> > > One could make an argument that byte offsets should never be close
> > > to ZWJ characters, but I think this decision is better left to the
> > > application, which knows what exactly it is presenting to the user.
> > The idea of the previous version of the protocol (from my understanding)
> > was to make sure that only valid UTF-8 and valid byte-offsets (== not
> > falling between bytes of a Unicode code point) into the string will be
> > sent to the client. If you just get a byte-offset into a UTF-8 encoded
> > string you trust the sender to honor the protocol and thus you can just
> > pass the UTF-8 encoded string unprocessed to your text rendering library
> > (provided that the library supports UTF-8 strings which is what I am
> > assuming) without having to parse the UTF-8 string into Unicode code
> > points.
> > Of course the Unicode code points will have to be parsed at some point
> > if you want to render them. Using byte-offsets just lets you do that at
> > a later stage if your libraries support UTF-8.
> Doesn't that chiefly depend on what kind of the text rendering library
> though? As far as I understand, passing text to rendering is necessary
> to calculate the cursor position. At the same time, it doesn't matter
> much for the calculations whether the cursor offset is in bytes or
> code points - the library does the parsing in the last step anyway.
> I think you mean that if the rendering library accepts byte offsets
> as the only format, the application would have to parse the UTF-8
> unnecessarily. I agree with this, but I'm not sure we should optimize
> for this case. Other libraries may support only code points instead.
> Did I understand you correctly?
Yes, that's what I meant. I also assumed that no text rendering library
expects you to pass the string length in Unicode points. I had a look
and the ones I managed to find expected their lengths in bytes:
* Pango: https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text
* Harfbuzz: https://harfbuzz.github.io/hello-harfbuzz.html
For those you would need to parse the UTF-8 string yourself first in
order to find out at which byte position the Unicodepoint stops where
the protocol wants you to draw the cursor (if the protocol sends Unicode
I feel like it would make sense to optimize for the more common case. I
assume that is the one where you need to pass a length in bytes to the
text rendering library, not in Unicode points.
Admittedly, I haven't used a lot of text rendering libraries so I would
very much like to hear more opinions on the issue.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 488 bytes
Desc: not available
More information about the wayland-devel