[PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

Mon May 7 19:55:41 UTC 2018

On Sun, May 06, 2018 at 10:37:57PM +0200, Dorota Czaplejewicz wrote:
> On Sat, 5 May 2018 13:37:44 +0200
> Silvan Jegen <s.jegen at gmail.com> wrote:
> 
> > On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:
> > > On Fri, 4 May 2018 22:32:15 +0200
> > > Silvan Jegen <s.jegen at gmail.com> wrote:
> > >   
> > > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:  
> > > > > On Thu, 3 May 2018 21:55:40 +0200
> > > > > Silvan Jegen <s.jegen at gmail.com> wrote:
> > >
> > > [...]
> > >
> > > In the end, I'm not an expert in that area either - perhaps treating
> > > client side strings as UTF-8 buffers makes sense, but at the moment
> > > I'm still leaning towards the code point abstraction.  
> > 
> > Someone (™) should probably implement a client making use of the protocol
> > to see what the real world impact of this protocol change would be.
> > 
> > The editor in the weston project uses pango for its text layout:
> > 
> > https://cgit.freedesktop.org/wayland/weston/tree/clients/editor.c#n824
> > 
> > so it would have to parse the UTF-8 string twice. The same is most likely
> > true for all programs using GTK...
> > 
> > 
> 
> I made an attempt to dig deeper, and while I stopped short of becoming
> this Someone for now, I gathered what I think are some important
> results.
> 
> First, the state of the libraries. There's a lot of data I gathered,
> so I'll keep this section rather dense. First, another contender
> for the title of text layout library, and that one uses code points
> exclusively:
> 
> https://github.com/silnrsi/graphite/blob/master/include/graphite2/Segment.h `gr_make_seg`
> 
> https://github.com/silnrsi/graphite/blob/master/tests/examples/simple.c
> 
> Afterwards, I focused on GTK and Qt. As an input method plugin
> developer, I looked at the IM interfaces and internal data structures
> they expose. The results were not that clear - no mention of "code
> points", some references to "bytes", many to "characters" (not
> "chars"). What is certain is that there's a lot of converting going on

Yes, it's very unfortunate that a lot of developers do not strife for
more clarity and precision in terminology when processing text.

> behind the scenes anyway. First off, GTK seems to be moving away from
> bytes, judging by the comments:
> 
> gtk 3.22 (`gtkimcontext.c`)
> 
> `gtk_im_context_delete_surrounding`
> 
> > * Asks the widget that the input context is attached to to delete
> > * characters around the cursor position by emitting the
> > * GtkIMContext::delete_surrounding signal. Note that @offset and @n_chars
> > * are in characters not in bytes which differs from the usage other
> > * places in #GtkIMContext.
> 
> `gtk_im_context_get_preedit_string`
> 
> > * @cursor_pos: (out): location to store position of cursor (in characters)
> > *              within the preedit string.  
> 
> `gtk_im_context_get_surrounding`
> 
> > * @cursor_index: (out): location to store byte index of the insertion
> > *        cursor within @text.
> 
> gtkEntry seems to store things internally as characters.

They mention "characters" but what they most likely mean are Unicode
code points.

One would think they would try to keep their APIs consistent but that
doesn't seem to be the case.

> While GTK using code points internally is not a proof of anything,
> it's a suggestion that there is a reason not to use bytes.
> 
> Then, Qt, from https://doc.qt.io/qt-5/qinputmethodevent.html#setCommitString
> 
> > replaceLength specifies the number of characters to be replaced
> 
> a confirmation that "characters" means "code points" comes from
> https://doc.qt.io/qt-5/qlineedit.html#cursorPosition-prop . The value
> reported when "æþ|" is displayed is 2.

https://doc.qt.io/qt-5/qstring.html

Qt uses UTF-16 internally so they *could* also be counting "QChars"
which are 16-bit (assuming the position is 0 indexed):

Python 3.6.5 (default, Apr 14 2018, 13:17:30)
[GCC 7.3.1 20180406] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "æþ"
'æþ'
>>> "æþ".encode("utf-16")
b'\xff\xfe\xe6\x00\xfe\x00'

If they are really doing that you would only notice it with characters
outside of the BMP because:

"(Unicode characters with code values above 65535 are stored using
surrogate pairs, i.e., two consecutive QChars.)"

I think everybody agrees that (Unicode) text handling is a mess in
general...

> I also spent more time than I should writing a demo implementation
> of an input method and a client connecting to it to check out the
> proposed interfaces. Predictably, it gave me a lot of trouble
> on the edges between bytes and code points, but I blame it on
> Rust's scarcity of UTF handling functions. The hack is available at
> https://code.puri.sm/dorota.czaplejewicz/impoc

Thanks for taking the time! I compiled and ran it but my rust is weak...

Rust has an interesting String type:

https://doc.rust-lang.org/std/string/struct.String.html#utf-8

It's UTF-8 encoded but you are not allowed to index into it.

> My impression at the moment is that it doesn't matter much how offsets
> within UTF strings are encoded, but that code points slightly better
> reflect what's going on in the GUI toolkits, apart from the benefits
> mentioned in my other emails. There seems to be so much going on
> behind the scenes and the parsing is so cheap that it doesn't make
> sense to worry about the computational aspect, just try to make things
> easier to get right.
> 
> Unless someone chimes in with more arguments, I'm going to keep using
> code points in following revisions.

The only argument I have for using byte offsets instead of Unicode code
points is that you will have to parse the UTF-8 string twice in case
your text rendering library lets you only use byte lengths. That seems
to be the case for pango, which I assume is commonly used.

If I come up with more arguments I will send another mail...

Cheers,

Silvan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20180507/6b348859/attachment.sig>