Unicode validation range
Thiago Macieira
thiago at kde.org
Fri Feb 19 10:39:04 PST 2010
Em Sexta-feira 19. Fevereiro 2010, às 18.35.14, Colin Walters escreveu:
> On Sat, Feb 6, 2010 at 5:28 PM, Thiago Macieira <thiago at kde.org> wrote:
> > I'm trying to understand why we reject the FDD0-FDEF range.
>
> I can't find offhand much information about this range - it looks like
> it's in Arabic Presentation Forms-A block?
This range is a "non-character" range. Unicode reserves those 32 codepoints,
plus the last two codepoints in each page as non-character. It means they will
never be assigned to anything.
Not to be confused with "unassigned". Those are valid Unicode codepoints that
aren't assigned yet, but may be assigned in the future.
Unicode says that "non-characters" cannot be used for text interchange.
However, applications are free to use those codepoints internally for their
own use. In fact, Qt's text framework uses U+FDD0 and U+FDD1 to delineate
frames (QTextFrame).
> > This has been
> > causing problems in some applications leading to even remote-crashable
> > (app receives UTF-8 string from network, app sends such string via
> > D-Bus, D-Bus disconnects unexepectedly, crash).
>
> Can you give a little more background on the concrete case? Is this
> where e.g. Qt is being more liberal in accepting into UTF16 than dbus
> is? What's the application in question?
Konversation.
What happened was that someone wrote the UTF-8 code corresponding to U+FDD0 on
IRC. Upon receiving such a message, Konversation converted it to UTF-16 using
normal QString means and passed that message to the notification system (i.e.,
KNotify, via D-Bus).
The message got reencoded to UTF-8 and passed to libdbus-1. However,
libdbus-1's UTF-8 validation routines concluded that it was invalid and
dropped the connection.
I'm willing to agree that QString conversion to/from UTF-8 should have caught
those (it already blocks UTF-16 surrogate codepoints when encoded in UTF-8, as
well as U+FFFE and U+FFFF).
However, one can also argue that those two processes communicating over D-Bus
constitute "one application" and should be allowed to use those codepoints.
> > I'm proposing we either:
> >
> > 1) remove the unnecessary checks and allow those characters in
> >
> > or
> > 2) update the list, to include FFFE, 1FFFE, 1FFFF, 2FFFE, 2FFFF, etc.
>
> Another major change we could make here is to have libdbus (or
> possibly just the bus) synthesize an error on "invalid" UTF-8 rather
> than disconnecting. This would be nontrivial, but one that I think
> developers at least would like.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Senior Product Manager - Nokia, Qt Development Frameworks
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
Url : http://lists.freedesktop.org/archives/dbus/attachments/20100219/148fc0d6/attachment-0001.pgp
More information about the dbus
mailing list