DBus API problems & UTF-8
Joerg Barfurth
Joerg.Barfurth at Sun.COM
Tue Jun 13 02:56:21 PDT 2006
Olivier Galibert wrote:
> On Mon, Jun 12, 2006 at 04:15:20PM +0100, Daniel P. Berrange wrote:
>> On Mon, Jun 12, 2006 at 04:46:54PM +0200, Olivier Galibert wrote:
>>> On Mon, Jun 12, 2006 at 01:47:24PM +0100, Daniel P. Berrange wrote:
>>>> If you're not already using UTF-8 in your C program, then its at most one
>>>> single method call to convert.
>>> So you're going to pretend all the world is iso-8859-1, which happens
>>> to be encoded as utf-8. I'm not sure if it's that much of a step up
>>> from "all the world is ascii".
>> Urm, where in my mail did I say you should "pretend all the world is
>> iso-8859-1" ? I said you just need to run a conversion from the program's
>> native character set (whatever that may be) to UTF-8 & vica-verca. No
>> assuption need be made about the application's native charset - only the
>> DBus charset is fixed.
>
> Most applications don't have a native charset. Most applications
> don't have the concept of charset at all. So converting to real utf-8
> is way way more than "one single method call".
>
Huh? We are talking about applications that deal with text strings
interoperating with other applications that also deal with text strings.
If an application deals with strings it needs to have a representation
of them that necessarily uses some encoding (or "character set").
Sequences of bytes (even without embedded null bytes) don't represent a
text string unless you also know (or can know) the encoding. To do
anything with it as a text string requires that there is an implicit or
explicit encoding. If you have an explicit encoding, then converting to
utf-8 really is only a single function call. If the encoding is implicit
in some context, then either that encoding should be fixed and specified
in documentation or there should be a function or algorithm to find out.
That is a problem of your context and its character set handling
conventions. If there is no way to know the encoding, then there is
nothing you can do with those bytes as a string.
So, if you can find out the encoding, then converting to utf-8 is a
single function call plus whatever is needed to find out the applicable
encoding. If you can't find out, then you don't have a string but an
array of bytes and should treat it as such.
In DBus the character set for all strings is implied to be UTF-8 (that
is the 'implied and documented' option). That has the added benefit that
none of the interoperating parties has to be able to handle all possible
encodings and/or rely on encoding conversion libraries to interpret
strings. But those that require a particular encoding need to be able to
interpret it and convert it to/from UTF-8.
- Jörg
--
Joerg Barfurth phone: +49 40 23646662 / x66662
Software Engineer mailto:joerg.barfurth at sun.com
Desktop Technology http://reserv.ireland/twiki/bin/view/Argus/
Thin Client Software http://www.sun.com/software/sunray/
Sun Microsystems GmbH http://www.sun.com/software/javadesktopsystem/
More information about the dbus
mailing list