python dbus: dealing with strings

Wed Sep 7 02:41:07 PDT 2011

On Wed, 07 Sep 2011 at 10:04:05 +0200, Neal H. Walfield wrote:
> I've found that dbus.UTF8String sometimes fails if its argument is a
> unicode string that contains non-ascii characters (also: dbus.String
> fails if its argument is a normal string that contains non-ascii
> characters).

dbus.UTF8String is a subtype of str; dbus.String is a subtype of unicode.
That should hopefully tell you all you need to know about the behaviour
of their constructors, and the misfeatures they inherit from those Python 2
types.

You might be misinterpreting them as distinct D-Bus types: they're not,
they're distinct Python representations of one D-Bus type ("s"). You can
use whichever you prefer and you'll get a D-Bus string sent to dbus-daemon -
D-Bus strings are always UTF-8, but dbus-python knows how to turn a Python
unicode object (or equivalently, a dbus.String) into UTF-8 for sending.

In particular:

>   >>> dbus.UTF8String(u'ä')

This is equivalent to dbus.UTF8String(str("ä")) which uses the global default
codec; the "default default" is ASCII, so it fails. dbus.String(u"ä") would
work.

>   >>> dbus.UTF8String('ä')
>   dbus.UTF8String('\xc3\xa4')

U+00E4 LATIN SMALL LETTER A WITH DIARESIS is 0xC3 0xA4 in UTF-8, so this is
working correctly (in this particular case that's because you're using
the interactive Python prompt in a UTF-8 locale - I'd guess you're on modern
Linux or Mac OS).

This is closer to what UTF8String is designed for - it's very convenient
when interfacing with libraries like Gtk, where string-typed properties etc.
are 'str' objects which are guaranteed to contain valid UTF-8.

The meaning of the UTF8String constructor is essentially "You see this 8-bit
string? I assert that it is valid UTF-8. Assume that it is." (I forget whether
dbus-python checks this assumption; it should.)

> Further, the other dbus conversion functions
> appear to do their best to convert the argument to the right type.
> Consider:
> 
>   >>> dbus.Int32("3")
>   dbus.Int32(3)

This is inherited from the 'int' type in the same way (Int32 is a subtype of
int).

> In my particular case, I want to avoid burdening the client with
> having to do casts itself by, say, only accepting UTF8 strings: the
> data is coming from the web via feedparser.

If the client gives you unicode objects, you can pass them straight through and
dbus-python will do the right thing (i.e. convert from whatever Python's
Unicode type is on your platform - typically UCS-4 on Linux, UTF-16 on
Windows - to UTF-8).

If the client gives you str objects, there is no right thing for dbus-python
to do - you're giving it a blob of bytes whose encoding is unknown,
dbus-python needs UTF-8, so it can never do better than a wild guess without
more information. The client has to tell you what encoding the data is in,
otherwise you (and dbus-python) have no way to know what the right thing to
do is.

To be honest, the easiest way is probably to require a unicode object,
and fail on non-unicode input, requiring the caller to decode bytes of
unspecified encoding in advance: explicit is better than implicit.

    # given this input:
    a1 = u"\u00e4"     # U+00E4 LATIN SMALL LETTER A WITH DIARESIS
    a2 = "\xe4"        # 0xE4, the Latin-1 encoding of U+00E4
    a3 = "\xc3\xa4"    # 0xC3 0xA4, the UTF-8 encoding of U+00E4

    # these three are equivalent
    foo(a1, b, c)
    foo(a2.decode('latin-1'), b, c)
    foo(a3.decode('utf-8'), b, c)

With hindsight, dbus-python should only have accepted unicode objects.

>   def foo(a, b, c):
>     return iface.foo(a if isinstance(a, unicode) else dbus.UTF8String(a),
>                      dbus.Int32(b), dbus.Boolean(c)

This will fail if a is a str object containing non-UTF8 data (in practice,
this typically means a str object containing latin-1 or windows-1252 data,
but it could be something more exotic).

> That's ugly.

Welcome to Python 2 unicode handling... (Python 3 fixes most of it.)

    S