[PATCH] do not call _dbus_warn_check_failed on checks
Havoc Pennington
hp at redhat.com
Mon Nov 13 16:52:03 PST 2006
Hi,
OK, I'll write the treatise on this and hopefully put it to bed.
1. The checks are only a helpful debugging feature, and feel free to
turn them off
===
If you just want to experience what libdbus would be like without these
pesky checks for your bugs, build with --disable-checks. Now your bugs
will be "ignored" - most libraries work like this.
2. The warnings are not "harmless" if you are in nonfatal warnings mode
===
Let's consider the libdbus-never-aborts case. Your best case is one
warning to the console. Another likely case is tons and tons of warning
spam in .xsession-errors or syslog since something that happens a lot is
buggy. Another likely case is that libdbus returns NULL you weren't
expecting, and your app crashes. Another likely case is that libdbus
itself gets confused somehow - maybe down the line, since e.g. a header
field that should have been set wasn't, etc. The backtrace from the
later confusion will not reflect the root cause. And then a worst case,
short of crashing right away, is that you were e.g. relying on having
sent a message, and now you haven't; and either you send another message
that semantically relied on the first, or you block waiting for the
first message, or whatever, and your app hangs or grows more and more
confused.
These cases assume libdbus even *has* a check for your bug. If you pass
an invalid pointer instead of NULL, or just do something that's
expensive to check for, then libdbus won't have a check and you'll
probably get an immediate crash.
Anyway, if you use the API in an undocumented and undefined way, whether
the checks are fatal or just warn, or even exist, the results are *in
practice* *not* *in theory* often catastrophic and quite bad. And often
include crashing or hanging sooner or later.
This seems to confuse people though; they think they can "recover" from
these undocumented, undefined situations. Those people probably also
catch SEGV and try to continue.
3. Whether the behavior of passing NULL should be documented/defined
is an independent issue
===
If the docs say "passing in NULL results in XYZ" then we *never* use
_dbus_return_if_fail; there will be unconditional, always-compiled-in
code to handle that NULL which will neither abort nor warn. (Unless the
docs are "defined to abort" or "defined to warn")
Feel free to argue that specific API dbus_blahblah should accept NULL
for a particular argument. If we document that, then the results will no
longer be undefined (and this discussion is not relevant).
4. Defining NULL-passed-in for every function in the API would
significantly suck up the API
===
The real issue is that a bunch of people want passing NULL pointers to
have defined and documented results. I'm sorry, but I think this is
silly. We'd be adding DBusError* returns to any function that took a
pointer arg, just so it could return a "you passed in NULL" error. If
you want to know if you're passing in NULL, then look at the pointer
before you pass it in and see if it's NULL.
IOW the code if we allow NULL is something like:
init_error(&error);
dbus_frobate(myptr, &error);
if (null_arg_error_set(&error)) {
// well, what are you going to put in here? I sure don't know
// and that's why the error isn't returned
}
If you are masochistic enough to want to write that, you are welcome to
write this instead:
if (!myptr)
// whatever
else
dbus_frobate(myptr);
If you don't know what's in your pointers you have bigger problems. They
could just as easily be bad pointers instead of NULL, and no check is
going to keep those from crashing.
Even if you disagree with this point and think all API functions that
take a pointer should also return an error code, then it's too late;
it's a huge API change and our API is frozen.
5. Comparison to "wrapper libraries" like Xlib, GNU libc
===
These "wrapper libraries" take some args in the process space of your
app, and pass them to another process or the kernel, and may return a
runtime error either generated in-process or as returned from the other
process/kernel.
The wrappers generally do one of three things with NULL or otherwise
invalid arguments:
a) crash (or behave in other undefined way)
b) pass the invalid arg to X or the kernel and let an error be set
c) set an "invalid arg" error themselves in-process
dbus will also do one of these three, as appropriate to the API at hand.
However, for most NULL pointers dbus will do a) because of points 3) and
4) I've already covered. Also, there's no way to pass NULL over the bus,
and many of the pointers you're passing in have no protocol equivalent,
so b) doesn't even make sense.
glibc and Xlib I'm pretty sure have a number of examples of a), also.
I always program as if they do; what happens if you pass NULL to
XDisplayName() or fread()? I don't know, but I doubt it means my program
will work as intended.
Remember that return_if_fail() is a subset of a). When libdbus would
have done return_if_fail with a handy warning, glibc and Xlib would
typically just crash or something else bad (like --disable-checks).
6. Comparison to Windows API
===
Some people are used to the Windows API, which *does* return an HRESULT
from every single call, and at the start of every single call checks for
NULL pointers and returns a failed HRESULT if they exist.
I think this leads to a giant pain-in-the-ass API as described in 4).
Also, it's completely unhelpful for debugging, because often you pass in
a big struct full of info, and you get an error like "there was
something invalid in that struct somwhere" - thanks for the help, guys...
And of course if you pass in an *invalid* non-NULL pointer Windows can't
do anything, it will often crash.
I'd rather have a nice warning and orderly exit, than an unhelpful error
code or silently trying to continue or crashing.
7. There is a difference between a programming error and a recoverable
error you might handle.
===
See one explanation:
http://developer.gnome.org/doc/API/2.0/glib/glib-Error-Reporting.html
When an error outside of your app's control could happen, D-Bus will
have documented behavior where it either transparently solves the
problem, or returns a DBusError, or otherwise reports the problem to you.
When you just pass junk in to the dbus API in a way that the docs say is
undefined, however, then the results are undefined. This is a
programming error in your app.
Now point-by-point on Daniel's mail:
Daniel Stone wrote:
> I was discussing this over dinner tonight, and someone asked me why I
> was mixing up the critical (X server) and non-critical (D-Bus) paths.
> My reply was that I decide what's critical and non-critical. If it's
> truly broken, then I'll call FatalError(). If not, I'll just log a
> message about it.
What we're talking about here is what happens if you use libdbus in an
undefined, undocumented, and therefore *unintentional* way.
You won't decide whether to call FatalError() because you aren't going
to call dbus this way on purpose. You will only call it this way if your
program has a bug. There is no way to write:
message = dbus_message_new(...);
if (check failed)
FatalError()
Because there is no way to know that a check failed. DBus does not
return an error code from *every function that takes a pointer that
could be NULL*, thankfully.
> I'm somewhat bemused (and rather depressed) that the upstream attitude
> is that D-Bus is a core component on exactly the same level as the
> kernel (this was one of the justifications for never being able to
> restart D-Bus), yet it does things like throw assert()s.
You're confusing three cases.
- internal bugs in the module (dbus or kernel). _dbus_assert() in dbus,
and an Oops for the kernel.
- another process sends junk to the module. This means libdbus received
bad data/args from another app, or the kernel gets bad system call
args. This leads to dbus returning a message of type ERROR or
disconnecting the remote app, but libdbus will not abort the current
app. The kernel in these cases returns a syscall error.
- the same process uses the module in an undocumented/undefined way.
The kernel analogy might be a device driver misusing core kernel
APIs, or something like that.
In dbus, *often* but by no means guaranteed there is a "check"
(dbus_return_if_fail) to help you debug this. In the kernel, you
are likely to get crashes, deadlocks, or even data loss.
> Given the
> situation today, your app doesn't even have to link to D-Bus for libdbus
> to be killing your app hard with asserts.
This is flat out wrong. If libdbus crashes due to the actions of
*another process* than the one libdbus is inside, then it's a bug in
libdbus. (OK, unless the other process is root running "kill" etc.)
The other process is untrusted and all data from it gets validated, and
if there's bad data then libdbus will send an error or disconnect from
the bad process.
If *you* in your app *trust* another app not to crash and it's buggy and
crashes, then yes you are hosed. But that is your own fault and libdbus
gives you the tools to recover gracefully. No return_if_fail checks will
be involved.
> Seriously, the behaviour doesn't even vaguely make sense at all. For
> development, sure. But there's a reason why the kernel, all POSIX APIs,
> in fact every API ever, X, HTTP, etc, return you errors when you ask for
> something invalid (a file that doesn't exist, an invalid URL, a
> malformed message), rather than ungracefully slaughter you.
The dbus *protocol* behaves just like the syscall interface or X
protocol or HTTP; it doesn't trust the other side. What we're talking
about here is the libdbus *library* which is in your address space and
all your usage of it is trusted and known to you.
The GLib and GTK+ policies are exactly the same as libdbus, except they
only warn by default. Recent devel releases of GNOME have gone to fatal
by default, however.
I prefer fatal by default because it keeps people from being confused
about what "undefined" means. i.e. they seem to think it means
"magically read my mind and figure out how to work around the bugs in my
app" instead of "what you requested makes no sense, so nothing sensible
has occurred, and your program still thinks whatever you intended *did*
occur, so will continue on doing more things that make no sense, and
it's impossible for either the library or your app to know what the
results will be"
If you still don't want to treat these check failures as serious bugs
and possible crashes, you are welcome to putenv("DBUS_FATAL_WARNINGS=0")
at the start of main(), and take your chances.
Havoc
More information about the dbus
mailing list