DBus API problems & UTF-8
Joerg.Barfurth at Sun.COM
Tue Jun 13 02:56:21 PDT 2006
Olivier Galibert wrote:
> On Mon, Jun 12, 2006 at 04:15:20PM +0100, Daniel P. Berrange wrote:
>> On Mon, Jun 12, 2006 at 04:46:54PM +0200, Olivier Galibert wrote:
>>> On Mon, Jun 12, 2006 at 01:47:24PM +0100, Daniel P. Berrange wrote:
>>>> If you're not already using UTF-8 in your C program, then its at most one
>>>> single method call to convert.
>>> So you're going to pretend all the world is iso-8859-1, which happens
>>> to be encoded as utf-8. I'm not sure if it's that much of a step up
>>> from "all the world is ascii".
>> Urm, where in my mail did I say you should "pretend all the world is
>> iso-8859-1" ? I said you just need to run a conversion from the program's
>> native character set (whatever that may be) to UTF-8 & vica-verca. No
>> assuption need be made about the application's native charset - only the
>> DBus charset is fixed.
> Most applications don't have a native charset. Most applications
> don't have the concept of charset at all. So converting to real utf-8
> is way way more than "one single method call".
Huh? We are talking about applications that deal with text strings
interoperating with other applications that also deal with text strings.
If an application deals with strings it needs to have a representation
of them that necessarily uses some encoding (or "character set").
Sequences of bytes (even without embedded null bytes) don't represent a
text string unless you also know (or can know) the encoding. To do
anything with it as a text string requires that there is an implicit or
explicit encoding. If you have an explicit encoding, then converting to
utf-8 really is only a single function call. If the encoding is implicit
in some context, then either that encoding should be fixed and specified
in documentation or there should be a function or algorithm to find out.
That is a problem of your context and its character set handling
conventions. If there is no way to know the encoding, then there is
nothing you can do with those bytes as a string.
So, if you can find out the encoding, then converting to utf-8 is a
single function call plus whatever is needed to find out the applicable
encoding. If you can't find out, then you don't have a string but an
array of bytes and should treat it as such.
In DBus the character set for all strings is implied to be UTF-8 (that
is the 'implied and documented' option). That has the added benefit that
none of the interoperating parties has to be able to handle all possible
encodings and/or rely on encoding conversion libraries to interpret
strings. But those that require a particular encoding need to be able to
interpret it and convert it to/from UTF-8.
Joerg Barfurth phone: +49 40 23646662 / x66662
Software Engineer mailto:joerg.barfurth at sun.com
Desktop Technology http://reserv.ireland/twiki/bin/view/Argus/
Thin Client Software http://www.sun.com/software/sunray/
Sun Microsystems GmbH http://www.sun.com/software/javadesktopsystem/
More information about the dbus