DBus API problems & UTF-8

Joerg Barfurth Joerg.Barfurth at Sun.COM
Tue Jun 13 02:56:21 PDT 2006

Olivier Galibert wrote:
> On Mon, Jun 12, 2006 at 04:15:20PM +0100, Daniel P. Berrange wrote:
>> On Mon, Jun 12, 2006 at 04:46:54PM +0200, Olivier Galibert wrote:
>>> On Mon, Jun 12, 2006 at 01:47:24PM +0100, Daniel P. Berrange wrote:

>>>> If you're not already using UTF-8 in your C program, then its at most one 
>>>> single method call to convert.

>>> So you're going to pretend all the world is iso-8859-1, which happens
>>> to be encoded as utf-8.  I'm not sure if it's that much of a step up
>>> from "all the world is ascii".

>> Urm, where in my mail did I say you should "pretend all the world is
>> iso-8859-1" ? I said you just need to run a conversion from the program's
>> native character set (whatever that may be) to UTF-8 & vica-verca. No
>> assuption need be made about the application's native charset - only the 
>> DBus charset is fixed.
> Most applications don't have a native charset.  Most applications
> don't have the concept of charset at all.  So converting to real utf-8
> is way way more than "one single method call".

Huh? We are talking about applications that deal with text strings 
interoperating with other applications that also deal with text strings. 
If an application deals with strings it needs to have a representation 
of them that necessarily uses some encoding (or "character set"). 
Sequences of bytes (even without embedded null bytes) don't represent a 
text string unless you also know (or can know) the encoding. To do 
anything with it as a text string requires that there is an implicit or 
explicit encoding. If you have an explicit encoding, then converting to 
utf-8 really is only a single function call. If the encoding is implicit 
in some context, then either that encoding should be fixed and specified 
in documentation or there should be a function or algorithm to find out. 
That is a problem of your context and its character set handling 
conventions. If there is no way to know the encoding, then there is 
nothing you can do with those bytes as a string.

So, if you can find out the encoding, then converting to utf-8 is a 
single function call plus whatever is needed to find out the applicable 
encoding. If you can't find out, then you don't have a string but an 
array of bytes and should treat it as such.

In DBus the character set for all strings is implied to be UTF-8 (that 
is the 'implied and documented' option). That has the added benefit that 
none of the interoperating parties has to be able to handle all possible 
encodings and/or rely on encoding conversion libraries to interpret 
strings. But those that require a particular encoding need to be able to 
interpret it and convert it to/from UTF-8.

- Jörg

Joerg Barfurth           phone: +49 40 23646662 / x66662
Software Engineer        mailto:joerg.barfurth at sun.com
Desktop Technology       http://reserv.ireland/twiki/bin/view/Argus/
Thin Client Software     http://www.sun.com/software/sunray/
Sun Microsystems GmbH    http://www.sun.com/software/javadesktopsystem/

More information about the dbus mailing list