[PATCH] do not call _dbus_warn_check_failed on checks

Mon Nov 13 16:52:03 PST 2006

Hi,

OK, I'll write the treatise on this and hopefully put it to bed.

1. The checks are only a helpful debugging feature, and feel free to
    turn them off
===

If you just want to experience what libdbus would be like without these 
pesky checks for your bugs, build with --disable-checks. Now your bugs 
will be "ignored" - most libraries work like this.

2. The warnings are not "harmless" if you are in nonfatal warnings mode
===

Let's consider the libdbus-never-aborts case. Your best case is one 
warning to the console. Another likely case is tons and tons of warning 
spam in .xsession-errors or syslog since something that happens a lot is 
buggy. Another likely case is that libdbus returns NULL you weren't 
expecting, and your app crashes. Another likely case is that libdbus 
itself gets confused somehow - maybe down the line, since e.g. a header 
field that should have been set wasn't, etc. The backtrace from the 
later confusion will not reflect the root cause. And then a worst case, 
short of crashing right away, is that you were e.g. relying on having 
sent a message, and now you haven't; and either you send another message 
that semantically relied on the first, or you block waiting for the 
first message, or whatever, and your app hangs or grows more and more 
confused.

These cases assume libdbus even *has* a check for your bug. If you pass 
an invalid pointer instead of NULL, or just do something that's 
expensive to check for, then libdbus won't have a check and you'll 
probably get an immediate crash.

Anyway, if you use the API in an undocumented and undefined way, whether 
the checks are fatal or just warn, or even exist, the results are *in 
practice* *not* *in theory* often catastrophic and quite bad. And often 
include crashing or hanging sooner or later.

This seems to confuse people though; they think they can "recover" from 
these undocumented, undefined situations. Those people probably also 
catch SEGV and try to continue.

3. Whether the behavior of passing NULL should be documented/defined
    is an independent issue
===

If the docs say "passing in NULL results in XYZ" then we *never* use 
_dbus_return_if_fail; there will be unconditional, always-compiled-in 
code to handle that NULL which will neither abort nor warn. (Unless the 
docs are "defined to abort" or "defined to warn")

Feel free to argue that specific API dbus_blahblah should accept NULL 
for a particular argument. If we document that, then the results will no 
longer be undefined (and this discussion is not relevant).

4. Defining NULL-passed-in for every function in the API would
    significantly suck up the API
===

The real issue is that a bunch of people want passing NULL pointers to 
have defined and documented results. I'm sorry, but I think this is 
silly. We'd be adding DBusError* returns to any function that took a 
pointer arg, just so it could return a "you passed in NULL" error. If 
you want to know if you're passing in NULL, then look at the pointer 
before you pass it in and see if it's NULL.

IOW the code if we allow NULL is something like:

init_error(&error);
dbus_frobate(myptr, &error);
if (null_arg_error_set(&error)) {
    // well, what are you going to put in here? I sure don't know
    // and that's why the error isn't returned
}

If you are masochistic enough to want to write that, you are welcome to 
write this instead:

if (!myptr)
   // whatever
else
   dbus_frobate(myptr);

If you don't know what's in your pointers you have bigger problems. They 
could just as easily be bad pointers instead of NULL, and no check is 
going to keep those from crashing.

Even if you disagree with this point and think all API functions that 
take a pointer should also return an error code, then it's too late; 
it's a huge API change and our API is frozen.

5. Comparison to "wrapper libraries" like Xlib, GNU libc
===

These "wrapper libraries" take some args in the process space of your 
app, and pass them to another process or the kernel, and may return a 
runtime error either generated in-process or as returned from the other 
process/kernel.

The wrappers generally do one of three things with NULL or otherwise 
invalid arguments:
  a) crash (or behave in other undefined way)
  b) pass the invalid arg to X or the kernel and let an error be set
  c) set an "invalid arg" error themselves in-process

dbus will also do one of these three, as appropriate to the API at hand.

However, for most NULL pointers dbus will do a) because of points 3) and 
4) I've already covered. Also, there's no way to pass NULL over the bus, 
and many of the pointers you're passing in have no protocol equivalent, 
so b) doesn't even make sense.

glibc and Xlib I'm pretty sure have a number of examples of a), also.
I always program as if they do; what happens if you pass NULL to 
XDisplayName() or fread()? I don't know, but I doubt it means my program 
will work as intended.

Remember that return_if_fail() is a subset of a). When libdbus would 
have done return_if_fail with a handy warning, glibc and Xlib would 
typically just crash or something else bad (like --disable-checks).

6. Comparison to Windows API
===

Some people are used to the Windows API, which *does* return an HRESULT 
from every single call, and at the start of every single call checks for 
NULL pointers and returns a failed HRESULT if they exist.

I think this leads to a giant pain-in-the-ass API as described in 4).

Also, it's completely unhelpful for debugging, because often you pass in 
a big struct full of info, and you get an error like "there was 
something invalid in that struct somwhere" - thanks for the help, guys...

And of course if you pass in an *invalid* non-NULL pointer Windows can't 
do anything, it will often crash.

I'd rather have a nice warning and orderly exit, than an unhelpful error 
code or silently trying to continue or crashing.

7. There is a difference between a programming error and a recoverable
    error you might handle.
===

See one explanation:
http://developer.gnome.org/doc/API/2.0/glib/glib-Error-Reporting.html

When an error outside of your app's control could happen, D-Bus will 
have documented behavior where it either transparently solves the 
problem, or returns a DBusError, or otherwise reports the problem to you.

When you just pass junk in to the dbus API in a way that the docs say is 
undefined, however, then the results are undefined. This is a 
programming error in your app.

Now point-by-point on Daniel's mail:

Daniel Stone wrote:
> I was discussing this over dinner tonight, and someone asked me why I
> was mixing up the critical (X server) and non-critical (D-Bus) paths.
> My reply was that I decide what's critical and non-critical.  If it's
> truly broken, then I'll call FatalError().  If not, I'll just log a
> message about it.

What we're talking about here is what happens if you use libdbus in an 
undefined, undocumented, and therefore *unintentional* way.

You won't decide whether to call FatalError() because you aren't going 
to call dbus this way on purpose. You will only call it this way if your 
program has a bug. There is no way to write:

  message = dbus_message_new(...);
  if (check failed)
    FatalError()

Because there is no way to know that a check failed. DBus does not 
return an error code from *every function that takes a pointer that 
could be NULL*, thankfully.

> I'm somewhat bemused (and rather depressed) that the upstream attitude
> is that D-Bus is a core component on exactly the same level as the
> kernel (this was one of the justifications for never being able to
> restart D-Bus), yet it does things like throw assert()s.

You're confusing three cases.

  - internal bugs in the module (dbus or kernel). _dbus_assert() in dbus,
    and an Oops for the kernel.

  - another process sends junk to the module. This means libdbus received
    bad data/args from another app, or the kernel gets bad system call
    args. This leads to dbus returning a message of type ERROR or
    disconnecting the remote app, but libdbus will not abort the current
    app. The kernel in these cases returns a syscall error.

  - the same process uses the module in an undocumented/undefined way.
    The kernel analogy might be a device driver misusing core kernel
    APIs, or something like that.
    In dbus, *often* but by no means guaranteed there is a "check"
    (dbus_return_if_fail) to help you debug this. In the kernel, you
    are likely to get crashes, deadlocks, or even data loss.

>  Given the
> situation today, your app doesn't even have to link to D-Bus for libdbus
> to be killing your app hard with asserts.

This is flat out wrong. If libdbus crashes due to the actions of 
*another process* than the one libdbus is inside, then it's a bug in 
libdbus. (OK, unless the other process is root running "kill" etc.)

The other process is untrusted and all data from it gets validated, and 
if there's bad data then libdbus will send an error or disconnect from 
the bad process.

If *you* in your app *trust* another app not to crash and it's buggy and 
crashes, then yes you are hosed. But that is your own fault and libdbus 
gives you the tools to recover gracefully. No return_if_fail checks will 
be involved.

> Seriously, the behaviour doesn't even vaguely make sense at all.  For
> development, sure.  But there's a reason why the kernel, all POSIX APIs,
> in fact every API ever, X, HTTP, etc, return you errors when you ask for
> something invalid (a file that doesn't exist, an invalid URL, a
> malformed message), rather than ungracefully slaughter you.

The dbus *protocol* behaves just like the syscall interface or X 
protocol or HTTP; it doesn't trust the other side. What we're talking 
about here is the libdbus *library* which is in your address space and 
all your usage of it is trusted and known to you.

The GLib and GTK+ policies are exactly the same as libdbus, except they 
only warn by default. Recent devel releases of GNOME have gone to fatal 
by default, however.

I prefer fatal by default because it keeps people from being confused 
about what "undefined" means. i.e. they seem to think it means 
"magically read my mind and figure out how to work around the bugs in my 
app" instead of "what you requested makes no sense, so nothing sensible 
has occurred, and your program still thinks whatever you intended *did* 
occur, so will continue on doing more things that make no sense, and 
it's impossible for either the library or your app to know what the 
results will be"

If you still don't want to treat these check failures as serious bugs 
and possible crashes, you are welcome to putenv("DBUS_FATAL_WARNINGS=0") 
at the start of main(), and take your chances.

Havoc