DBus API problems & UTF-8

Tue Jun 13 02:56:21 PDT 2006

Olivier Galibert wrote:
> On Mon, Jun 12, 2006 at 04:15:20PM +0100, Daniel P. Berrange wrote:
>> On Mon, Jun 12, 2006 at 04:46:54PM +0200, Olivier Galibert wrote:
>>> On Mon, Jun 12, 2006 at 01:47:24PM +0100, Daniel P. Berrange wrote:

>>>> If you're not already using UTF-8 in your C program, then its at most one 
>>>> single method call to convert.

>>> So you're going to pretend all the world is iso-8859-1, which happens
>>> to be encoded as utf-8.  I'm not sure if it's that much of a step up
>>> from "all the world is ascii".

>> Urm, where in my mail did I say you should "pretend all the world is
>> iso-8859-1" ? I said you just need to run a conversion from the program's
>> native character set (whatever that may be) to UTF-8 & vica-verca. No
>> assuption need be made about the application's native charset - only the 
>> DBus charset is fixed.
> 
> Most applications don't have a native charset.  Most applications
> don't have the concept of charset at all.  So converting to real utf-8
> is way way more than "one single method call".
> 

Huh? We are talking about applications that deal with text strings 
interoperating with other applications that also deal with text strings. 
If an application deals with strings it needs to have a representation 
of them that necessarily uses some encoding (or "character set"). 
Sequences of bytes (even without embedded null bytes) don't represent a 
text string unless you also know (or can know) the encoding. To do 
anything with it as a text string requires that there is an implicit or 
explicit encoding. If you have an explicit encoding, then converting to 
utf-8 really is only a single function call. If the encoding is implicit 
in some context, then either that encoding should be fixed and specified 
in documentation or there should be a function or algorithm to find out. 
That is a problem of your context and its character set handling 
conventions. If there is no way to know the encoding, then there is 
nothing you can do with those bytes as a string.

So, if you can find out the encoding, then converting to utf-8 is a 
single function call plus whatever is needed to find out the applicable 
encoding. If you can't find out, then you don't have a string but an 
array of bytes and should treat it as such.

In DBus the character set for all strings is implied to be UTF-8 (that 
is the 'implied and documented' option). That has the added benefit that 
none of the interoperating parties has to be able to handle all possible 
encodings and/or rely on encoding conversion libraries to interpret 
strings. But those that require a particular encoding need to be able to 
interpret it and convert it to/from UTF-8.

- Jörg

-- 
Joerg Barfurth           phone: +49 40 23646662 / x66662
Software Engineer        mailto:joerg.barfurth at sun.com
Desktop Technology       http://reserv.ireland/twiki/bin/view/Argus/
Thin Client Software     http://www.sun.com/software/sunray/
Sun Microsystems GmbH    http://www.sun.com/software/javadesktopsystem/