DBus API problems & UTF-8

Mon Jun 12 02:40:01 PDT 2006

On Mon, 2006-06-12 at 12:06, ext Thiago Macieira wrote:
> Kimmo Hämäläinen wrote:
> >For example, if dbus_message_iter_append_basic() returns FALSE, the
> >caller cannot know whether 1) an invalid argument was provided, or 2)
> >out-of-memory happened. However, the caller might want to handle
> >situation 1 differently from 2.
> [snip]
> >Is there any will to fix these API problems? I propose fixing them by
> >providing additional API that would be as close to the old as possible
> >and slowly deprecating the old API.
> 
> The solution would probably be to have a DBusError that is attached to 
> that message and is set by any failing functions. This way, there would 
> be no API changes, but it would allow you to obtain the error condition.

You mean that the caller could check if the message is valid (no errors
happened) before sending it?

> It would require, internally, that the validation checks 
> (_dbus_return_if_val and similar functions) be changed to take an error 
> condition as well.

Yes, some code needs to be changed, but I think the client API should be
as reliable and predictable. If it's not, it will do lot of damage when
DBus usage spreads to great number of applications.

> >Btw. why on earth DBus has to limit valid string data to UTF-8? I see no
> >reason why the string data should be even validated in the server (as it
> >now does). Seems like another unnecessary limitation -- or perhaps a
> >some kind of political statement (think of some very widely used Asian
> >encodings).
> 
> Because we don't want to hear about encodings. If you do that, then client 
> and server have to negotiate an encoding before they can start receiving 
> strings from one another. There's also the potential that one of the two 
> doesn't have the necessary codec installed.

I mean DBUS_TYPE_STRING. I think the specification should just say that
it's a NUL-terminated sequence of bytes. That way we don't have to care
about encodings, and we don't have to verify UTF-8 in the server (there
is already enough unnecessary O(n) stuff happening in the code...).

My point is that the DBus specification does not seem to have any reason
for specifying UTF-8 as the encoding -- NUL-terminated byte (save zero
byte) array would allow for more efficient communication when some other
encoding is used between applications, and the validity check for the
string data would be left entirely to the applications (where it belongs
-- DBus is just a message bus, it should not inspect the content).

> So we avoid the problem by saying that all strings are Unicode (just like 
> in Java and Qt). The wire format just says that the encoding is UTF-8 for 
> strings. Internally, the applications can use whatever encoding they 
> prefer (in Qt, we expand to UTF-16 before passing to the user).

Java and Qt are different, because they need to process the string data.
DBus is a message bus and should not care about the actual content as
long as the message format is correct.

> If you need to send something in another encoding, send as ARRAY of BYTE, 
> with the encoding identification (e.g., the MIB) as a separate parameter.

Yes, byte array is an alternative, but isn't it just a workaround for
bad design. Applications would benefit from a generic string type,
because otherwise they need code for distinguishing byte array (binary
data) from a string (NUL-terminated, of some encoding). DBus is
supposedly meant to serve applications, not the other way around.

Dropping UTF-8 would be probably simple -- just removing the validation
from the server (that seems to be the only place interested of the
content) and updating the specification. However, it would still be
dangerous change at this point because some applications could be
counting on it. (I'm just bringing this up for some future version of
the specification.)

BR; Kimmo