Unicode validation range

Fri Feb 19 10:39:04 PST 2010

Em Sexta-feira 19. Fevereiro 2010, às 18.35.14, Colin Walters escreveu:
> On Sat, Feb 6, 2010 at 5:28 PM, Thiago Macieira <thiago at kde.org> wrote:
> > I'm trying to understand why we reject the FDD0-FDEF range.
> 
> I can't find offhand much information about this range - it looks like
> it's in Arabic Presentation Forms-A block?

This range is a "non-character" range. Unicode reserves those 32 codepoints, 
plus the last two codepoints in each page as non-character. It means they will 
never be assigned to anything.

Not to be confused with "unassigned". Those are valid Unicode codepoints that 
aren't assigned yet, but may be assigned in the future.

Unicode says that "non-characters" cannot be used for text interchange. 
However, applications are free to use those codepoints internally for their 
own use. In fact, Qt's text framework uses U+FDD0 and U+FDD1 to delineate 
frames (QTextFrame).

> > This has been
> > causing problems in some applications leading to even remote-crashable
> > (app receives UTF-8 string from network, app sends such string via
> > D-Bus, D-Bus disconnects unexepectedly, crash).
> 
> Can you give a little more background on the concrete case?  Is this
> where e.g. Qt is being more liberal in accepting into UTF16 than dbus
> is?  What's the application in question?

Konversation.

What happened was that someone wrote the UTF-8 code corresponding to U+FDD0 on 
IRC. Upon receiving such a message, Konversation converted it to UTF-16 using 
normal QString means and passed that message to the notification system (i.e., 
KNotify, via D-Bus).

The message got reencoded to UTF-8 and passed to libdbus-1. However, 
libdbus-1's UTF-8 validation routines concluded that it was invalid and 
dropped the connection.

I'm willing to agree that QString conversion to/from UTF-8 should have caught 
those (it already blocks UTF-16 surrogate codepoints when encoded in UTF-8, as 
well as U+FFFE and U+FFFF).

However, one can also argue that those two processes communicating over D-Bus 
constitute "one application" and should be allowed to use those codepoints.

> > I'm proposing we either:
> > 
> > 1) remove the unnecessary checks and allow those characters in
> > 
> > or
> > 2) update the list, to include FFFE, 1FFFE, 1FFFF, 2FFFE, 2FFFF, etc.
> 
> Another major change we could make here is to have libdbus (or
> possibly just the bus) synthesize an error on "invalid" UTF-8 rather
> than disconnecting.  This would be nontrivial, but one that I think
> developers at least would like.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
  Senior Product Manager - Nokia, Qt Development Frameworks
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
Url : http://lists.freedesktop.org/archives/dbus/attachments/20100219/148fc0d6/attachment-0001.pgp