[PATCH 0/5] Improve text protocol

Thu May 2 12:56:25 PDT 2013

On Tue, Apr 16, 2013 at 06:19:47PM -0700, Bill Spitzak wrote:
> Jan Arne Petersen wrote:
> 
> >I completely agree that editing UTF-8 text as UTF-8 is fine.
> >
> >I am just wondering if we should have offsets in "Unicode code points"
> >(with the addition that for invalid byte sequences each byte counts as
> >one code point) or offsets in bytes.
> 
> The reason for the offset in bytes is that it is unambiguous about
> what position it means. Though I think erros should count as one
> code point, this avoids the need to define it at all, because the
> client does not have to agree with the input method about how to
> count them.
> 
> >And when we use offsets in byte how should the toolkit and input method
> >handle offsets in bytes which do not match code points.
> >
> >For example we have a surrounding text of "€–" (cursor is at offset 0):
> >0xe2 0x82 0xac 0xe2 0x80 0x93
> >
> >What should the toolkit do with such requests like the following?
> >* delete_surrounding_text(index: 1, length: 3)
> 
> I would delete the bytes indicated and show the resulting string,
> with error boxes for the now-bad bytes.
> 
> >* cursor_position(index: 2)
> 
> I would place the cursor at the position of the glyph produced by
> that byte, which could include some bytes on either side of it. Note
> that for combining characters this is a problem that needs to be
> solved even for valid UTF-8 (ie what does it mean if you point
> between the letter and the accent?).
> 
> >or
> >* preedit_style(index: 2, length: 3, style: underline) with above text
> >as preedit.
> 
> I would remember the positions of styling as bytes. However the
> renderer can render as though they are moved left to the first break
> between glyphs (ie it will preedit-highlight the character the first
> byte is in, and if the preedit region ends in the middle of a glyph
> then that glyph will not be preedited). Again this problem needs to
> be solved for combining characters anyway so this is not any more
> difficult.
> 
> In all cases the client can potentially detect that the input method
> is screwing up, and perhaps report this as a warning message.

I think consensus is that we leave the offsets as bytes.  I agree with
that, considering that: 1) it shouldn't happen, 2) when it does, the
toolkit will have deal with it.

Kristian