recursive types, struct, custom, dict, etc.

Tue Jun 1 13:16:11 PDT 2004

Hi,

This is I think the most difficult remaining implementation task to be
ready for 1.0, and the remaining protocol change. I could be wrong. See
previous threads e.g. 
http://freedesktop.org/pipermail/dbus/2004-March/000840.html
http://freedesktop.org/pipermail/dbus/2004-March/000919.html
and I'm pretty sure it's come up a few other times.

I'll append some preliminary notes on the subject proposing how we
address this stuff. In essence add STRUCT and fully recursify the type
system, or back down again to a limited set of primitive types.

Havoc

Current Situation
===

Wire protocol looks like:

  typecode, data, typecode, data, typecode, data

Only in the case of arrays, it's really:

  array signature, data, array signature, data

where "array signature" is a series of typecodes such as "array,
array, int" for array of array of int.

We have type CUSTOM which is:

  typecode = CUSTOM, name of type, array of bytes

where array of bytes is an unknown opaque blob interpreted by the
message recipient based on "name of type" only if the recipient has
existing knowledge of how to interpret blobs with this type name.

And type DICT which is:

  typecode = DICT, string, typecode, data, string, typecode, data,
  string, typecode, data

here "typecode" can be "array signature" also. So effectively a DICT 
is a map<string, variant>

Proposed Situation
===

Wire protocol looks like:

  message body signature, message body data

In other words we do the whole thing as we do the array.

At the same time, we can replace CUSTOM with a STRUCT type essentially
for free; the reason is that a message body with a series of types is
the same thing as a struct, we just make it recursive and we have
structs.

For the type signature format, right now "array of array of string" is 
"aas" and "int" is "i" and so forth; so e.g. foo (int, string) 
has signature "is"

For structs we basically introduce grouping, so we could represent by
parens. Say we have foo (int, struct { double, double }) that could
have type signature "i(dd)"
If we have foo (int, array of struct { double, double}) that is 
"ia(dd)" and so forth.

In this case structs are almost the same as CUSTOM but there's no name
for the struct. If we wanted we could name structs, maybe just insert
that into the type signature in some conventional way:
 "ia('MyStruct'dd)"

The problem with this is that it puts one bit of instrospection
annotation in the protocol, while most introspection annotation is in
the Introspect() return value. More discussion later in these notes,
see below.

If we introduced a variant type (pretend its code is "v") we could
replace DICT with something like:
 "('StringVariantMap'asav)"

i.e. struct StringVariantMap { array<string>; array<variant>; }

We could even standardize some map types.

A variant type of course uses the old (current) way of doing things,
with the typecode alongside the value instead of part of the
signature.

API implications
===

get_args, append_args, etc. would allow a series of type codes for
each arg:

  DBUS_TYPE_ARRAY, DBUS_TYPE_ARRAY, DBUS_TYPE_INT = array of array of
int

  same as "aai"

  DBUS_TYPE_STRUCT_START, DBUS_TYPE_INT, DBUS_TYPE_INT,
  DBUS_TYPE_STRUCT_END = 
    struct { int; int; }

  same as "(ii)"

Alternatively you could change it to something like:

  struct { double d1; double d2 } my_struct;
  get_args (message, "ib(dd)", &my_int, &my_bool, &my_struct);

get_args() should not necessarily support structs though, since once
you have structs containing arrays and other such nonsense you rapidly
get into a huge API to manage them, which isn't worth it. The usual
D-BUS approach would be that you have to use message iterators to
recurse into the struct, and suck each element into a binding-specific
data type. In fact that's what I would like to do, keeping D-BUS as
only a wire protocol and not an in-process type system framework.

I would propose that whenever D-BUS implements a get_int() (or
equivalent via get_args()) that the wire protocol may contain a
VARIANT, which would be automatically converted to int if the variant
indeed contains an int. This would allow language bindings such as
python to always return variant types over the wire (often the real
type is unknown), and still interoperate with other bindings. e.g.
if in python I return an empty list, that would go back as a method
reply with argument ARRAY of VARIANT, and then if some C code asks for
an ARRAY of INT it would successfully get an empty ARRAY of INT.

Discussion
===

Reasons to make this change:

 - it's all elegant and stuff

 - it should clean up the code a bit, the code is currently doing
   things both ways (for arrays and for everything else), though
   keeping a variant type preserves the two ways to some extent

 - we can typecheck incoming messages with a single strcmp();
   also overloaded methods could be more quickly routed

 - maps more naturally to statically typed languages

 - makes custom types more interoperable, since there's a 
   standard marshaling format available. e.g. QPixmap 
   could be a struct with width, height, and byte array of pixels,
   instead of a QDataStream-based format used with type CUSTOM.

 - allows a mode of operation where the type signature is simply
   omitted (we could make it an optional header field), in this case
   data validation happens late when the app unpacks the message
   rather than immediately when the library loads the message off the
   wire. May or may not be useful to allow this or make it the
   normal mode. This mode precludes method overloading.

Reasons not to make this change:

 - structs are probably sort of annoying to deal with in language
   bindings

 - overall complexity of a recursive type system; however, "array of
   array" has already caused us to eat this complexity, for the most 
   part. To avoid it we really need to go back to ARRAY_OF_INT etc. 
   as primitives, with no recursive arrays.

 - it's a change, and will break things and be a fair bit of work

 - building the message via append, append, append will be appending
   to both message body and type signature, which means two realloc
   rather than one

 - demarshaling will have to track locations in both signature and
   data (already true for arrays, though)

 - in some cases where we currently return an integer type code, we
   may need to return a type signature string

So the summary I would say is that we should either drop array of
array and go back to a straightforward hardcoded type list, plus an
escape hatch of CUSTOM. Or we should go all the way and get the
benefits of adding STRUCT and breaking type signatures apart from type
codes.

Odds and Ends
===

The NIL type:

  NIL doesn't make a hell of a lot of sense as a *type*, really it's a 
  value that's allowed in *some* languages to replace a value of any
  type. I think we need to get rid of DBUS_TYPE_NIL since I can't make 
  any sense out of it.

Struct names:

  I think there's a good argument to be made that struct names should
  not be in the type signature or protocol, but instead be in the 
  introspection data (where we also have arg names already, and could 
  add struct field names in addition to the name of the struct
  itself).

  So in this case taking the StringVariantMap example, you could 
  bind that as a hash table if you had the introspection data and thus 
  knew it was StringVariantMap, vs. some other struct with an array of
  string and an array of variant. But without the introspection data 
  you'd bind it as a generic struct with two arrays in it.

  I'm in favor of not putting the struct name in the type signature 
  for this reason.