[PATCH] do not call _dbus_warn_check_failed on checks

Tue Nov 14 02:07:45 PST 2006

On Mon, Nov 13, 2006 at 07:52:03PM -0500, Havoc Pennington wrote:
> OK, I'll write the treatise on this and hopefully put it to bed.

If only. :)

> 1. The checks are only a helpful debugging feature, and feel free to
>    turn them off
> ===
> 
> If you just want to experience what libdbus would be like without these 
> pesky checks for your bugs, build with --disable-checks. Now your bugs 
> will be "ignored" - most libraries work like this.

Yes, except the distributions all build

And my bugs won't be 'ignored'.  You're missing the point entirely; I
don't want libdbus to continue if it knows it's in an impossible
situation.  I don't want you to remove all the check lines.  I want you
to change them to return NULL, return -1, *errval = YOU_HAVE_FAILED &&
return, or whatever helps you sleep best.

libdbus is not, it must be said, a simple library.  What you're talking
about is not even checking against NULL pointers being passed in (which
I'm careful to guard against), but what if I pass in a slightly invalid
message due to a bug on my side?  What if I'm dealing with an incoming
message and only read half the args before I decide I don't want to know
any more about it -- because it's a treacherous message trying to poison
the server -- and throw it away?  It's entirely conceivable that someone
could assert() on unref because all the args weren't inspected.

Note the 'critical paths' thing here: if I fail to deal with input
hotplug properly, I take all the reasonable precautions to ensure I
don't segfault in my _own_ code.  If libdbus takes precautions to ensure
it doesn't segfault in _its_ code, then that's fantastic.  I know it
can't extend to all situations, such as an invalid pointer.  But I still
want it to have a defined 'you screwed up' return value, for non-obvious
situations like 'the message is invalid for this bus', 'one of your
arguments is NULL, and this was fine in previous versions but not
anymore', etc.

So, say I somehow fail at adding an input device.  Why should that bring
down my server in such an unclean way that the next attempt to switch
VTs or start X again, will result in a completely wedged system (no
SysRq for you, hope everything was saved and your disk synced)?  Why
should it bring down a word processor so hard it doesn't have a chance
to save?

You're optimising for shit coders.  And that's great, if you want to do
that, then you can, but I like to think that X and most open-source
projects can do better than that, and will.

> 2. The warnings are not "harmless" if you are in nonfatal warnings mode
> ===
> 
> Let's consider the libdbus-never-aborts case. Your best case is one 
> warning to the console. Another likely case is tons and tons of warning 
> spam in .xsession-errors or syslog since something that happens a lot is 
> buggy. Another likely case is that libdbus returns NULL you weren't 
> expecting, and your app crashes. Another likely case is that libdbus 
> itself gets confused somehow - maybe down the line, since e.g. a header 
> field that should have been set wasn't, etc. The backtrace from the 
> later confusion will not reflect the root cause. And then a worst case, 
> short of crashing right away, is that you were e.g. relying on having 
> sent a message, and now you haven't; and either you send another message 
> that semantically relied on the first, or you block waiting for the 
> first message, or whatever, and your app hangs or grows more and more 
> confused.

Logs filling up happens, people deal.  If libdbus returns NULL you
weren't expecting, and you crash, then it's provably your bug.  If
libdbus returns NULL and you catch it, then you can decide exactly what
to do with this error.  Kind of like open returning a negative value,
rather than assert()ing because the programmer was too stupid to call
access().

I'm not arguing for these checks to be removed entirely -- not even
remotely.  What I'm suggesting, is that the checks turn from 'your
entire machine suffers a fiery assert() death', to 'define an error
return value in the API, use it'.

Bad coders here will still get punished.  Generally rigid coders who
happen to slip up somewhere will get the opportunity to do the right
thing, instead of forcing the vision of purity on all the world and
removing all bugs ever, or something.

(Think of corner cases where you can't reproduce it normally.  If you
 can log something useful, maybe even throw an exception and get a
 backtrace, then the good coder will be able to find out what's wrong
 despite having no way to reproduce it, and maybe fix it.  assert()
 gives you no useful information at all, ever, and enforces its own
 worldview on the app.)

> These cases assume libdbus even *has* a check for your bug. If you pass 
> an invalid pointer instead of NULL, or just do something that's 
> expensive to check for, then libdbus won't have a check and you'll 
> probably get an immediate crash.

If libdbus does not have a check for your bug, then assert() does not
help, so this is irrelevant.

> Anyway, if you use the API in an undocumented and undefined way, whether 
> the checks are fatal or just warn, or even exist, the results are *in 
> practice* *not* *in theory* often catastrophic and quite bad. And often 
> include crashing or hanging sooner or later.

Sure.  If I pass in message=0xfffffffffffffffe, then I would expect bad
things to be happening.  Nothing can prevent that.  But you're going off
on an irrelevant tangent again: when libdbus _detects_ an error it
_knows_ about, it should tell you, rather than blowing up your house,
and shooting your dog.

I can guarantee you that this will lead to _loss_ of information.  When
the X server goes down that hard, nothing will be usefully logged, you
will just end up losing very badly.  So all people know is that the X
server came down hard, gdm restarted and then hey, IERR.  As opposed to
useful logged information about what happened and possibly why.  Again,
this penalises good coders, by optimising for rent-a-coders.  If that's
the direction you want to take, then I'm glad for you, but not
personally interested.

> This seems to confuse people though; they think they can "recover" from 
> these undocumented, undefined situations. Those people probably also 
> catch SEGV and try to continue.

Awesome, a strawman.  You can check Xorg and KDrive's segfault handlers
if you like, and note that this is provably untrue.

> 3. Whether the behavior of passing NULL should be documented/defined
>    is an independent issue
> ===
> 
> If the docs say "passing in NULL results in XYZ" then we *never* use 
> _dbus_return_if_fail; there will be unconditional, always-compiled-in 
> code to handle that NULL which will neither abort nor warn. (Unless the 
> docs are "defined to abort" or "defined to warn")
> 
> Feel free to argue that specific API dbus_blahblah should accept NULL 
> for a particular argument. If we document that, then the results will no 
> longer be undefined (and this discussion is not relevant).

Make your undefined behaviour less obnoxious.  At every single point
where you have an assert(), wonder what a useful way to signal an error
back to the _application_ consumer of your _library_ is, such that they
can signal the user, who should obviously be forced to care about D-Bus
errors.  Then change your asserts, to returning errors.

If the app is buggy, then the app is buggy, and it'll crash.  If the app
is careful, with a subtle typo or race somewhere, but the people who
wrote it were careful enough to check for errors, then they'll decide
what's appropriate.

> 4. Defining NULL-passed-in for every function in the API would
>    significantly suck up the API
> ===
> 
> The real issue is that a bunch of people want passing NULL pointers to 
> have defined and documented results. I'm sorry, but I think this is 
> silly. We'd be adding DBusError* returns to any function that took a 
> pointer arg, just so it could return a "you passed in NULL" error. If 
> you want to know if you're passing in NULL, then look at the pointer 
> before you pass it in and see if it's NULL.
> 
> IOW the code if we allow NULL is something like:
> 
> init_error(&error);
> dbus_frobate(myptr, &error);
> if (null_arg_error_set(&error)) {
>    // well, what are you going to put in here? I sure don't know
>    // and that's why the error isn't returned
> }

No, you write this:
if (!myptr)
    return;
dbus_frobate(myptr, &error);
if (dbus_error_is_set(&error)) {
    /* some incredibly obscure check triggered, no-one cares */
    g_error("loss");
}

> If you are masochistic enough to want to write that, you are welcome to 
> write this instead:
> 
> if (!myptr)
>   // whatever
> else
>   dbus_frobate(myptr);

Yes, and people do.

> If you don't know what's in your pointers you have bigger problems. They 
> could just as easily be bad pointers instead of NULL, and no check is 
> going to keep those from crashing.

No, they're not, but your behaviour when they're obviously bad
(explicitly initialising all your pointers rather than using
uninitialised memory), is obnoxious.  But this is not just about NULL
pointers, surely.

> Even if you disagree with this point and think all API functions that 
> take a pointer should also return an error code, then it's too late; 
> it's a huge API change and our API is frozen.

Guess I have to bribe Alp into rewriting libdbus in C, too.

> 5. Comparison to "wrapper libraries" like Xlib, GNU libc
> ===

Before we start here, let me say that Xlib is a) provably not even close
to the API you want, b) provably close to the API you have.

If you don't believe me, look at X's error handling, and the I/O error
handling (a grossly overloaded term) in particular.

> These "wrapper libraries" take some args in the process space of your 
> app, and pass them to another process or the kernel, and may return a 
> runtime error either generated in-process or as returned from the other 
> process/kernel.

Yes.

> The wrappers generally do one of three things with NULL or otherwise 
> invalid arguments:
>  a) crash (or behave in other undefined way)
>  b) pass the invalid arg to X or the kernel and let an error be set
>  c) set an "invalid arg" error themselves in-process

Yes.

> dbus will also do one of these three, as appropriate to the API at hand.

assert() is never, ever, appropriate.  There's no excuse.  This to me,
is axiomatic, coming from my attempted (foolish) use of D-Bus somewhere
where assert() leads to IERR.

> However, for most NULL pointers dbus will do a) because of points 3) and 
> 4) I've already covered. Also, there's no way to pass NULL over the bus, 
> and many of the pointers you're passing in have no protocol equivalent, 
> so b) doesn't even make sense.

Right, so we're down to 'crash', or 'gracefully return an error to the
callee'.  Honestly, I know which one I'd pick (D-Bus is not code that
desperately needs to be performant as it's on the critical path, so you
can't use glibc as a parallel).

> glibc and Xlib I'm pretty sure have a number of examples of a), also.
> I always program as if they do; what happens if you pass NULL to 
> XDisplayName() or fread()? I don't know, but I doubt it means my program 
> will work as intended.

Yes, glibc has a number of examples of a) on performance-critical paths,
but, shockingly, lots more of c).  Xlib has a lot of b), and still quite
a bit of c).

> Remember that return_if_fail() is a subset of a). When libdbus would 
> have done return_if_fail with a handy warning, glibc and Xlib would 
> typically just crash or something else bad (like --disable-checks).

Um, this is not actually the case.  But maybe I just have strange ideas
about what APIs I use should do.

> 6. Comparison to Windows API
> ===
> 
> Some people are used to the Windows API, which *does* return an HRESULT 
> from every single call, and at the start of every single call checks for 
> NULL pointers and returns a failed HRESULT if they exist.

I haven't used the Windows API since I was tooling around with ISAPI
filters when I was 10.  But this seems like a decent idea to me.  If you
can check, and it won't provably cost the application in terms of
performance for you to do so, why not?

> Also, it's completely unhelpful for debugging, because often you pass in 
> a big struct full of info, and you get an error like "there was 
> something invalid in that struct somwhere" - thanks for the help, guys...

Um.  assert() brings your entire program down roughly because you had
something invalid in that struct somewhere.  return NULL brings your
entire program down gracefully (or allows you to continue, at the app's
discretion: again, poor apps will be poor apps regardless, whereas good
apps can work out the criticality and isolation, and continue on if they
feel it's safe) because you had something invalid in that struct
somewhere.

If you're arguing for fine-grained errors to be returned to the app,
then sure, go ahead.  But assert() ain't it.

> And of course if you pass in an *invalid* non-NULL pointer Windows can't 
> do anything, it will often crash.

Of course it can't.  But that's no excuse to assert() on a NULL pointer.

> I'd rather have a nice warning and orderly exit, than an unhelpful error 
> code or silently trying to continue or crashing.

You aren't even making any sense.  First, you're arguing for libraries
to randomly assert(), bringing down whatever was foolish enough to link
them in.  Now you're arguing for nice warnings and orderly exits, rather
than either unhelpful error codes ('_dst == NULL in
dbus-message.c:97 oh my god the sky is falling') that mean nothing to
the user and are thoroughly useless to the developer, or continuing and
crashing?

If your mail consisted solely of this paragraph, I'd construe it as
violent agreement.

> Now point-by-point on Daniel's mail:
> 
> Daniel Stone wrote:
> >I was discussing this over dinner tonight, and someone asked me why I
> >was mixing up the critical (X server) and non-critical (D-Bus) paths.
> >My reply was that I decide what's critical and non-critical.  If it's
> >truly broken, then I'll call FatalError().  If not, I'll just log a
> >message about it.
> 
> What we're talking about here is what happens if you use libdbus in an 
> undefined, undocumented, and therefore *unintentional* way.
> 
> You won't decide whether to call FatalError() because you aren't going 
> to call dbus this way on purpose. You will only call it this way if your 
> program has a bug. There is no way to write:
> 
>  message = dbus_message_new(...);
>  if (check failed)
>    FatalError()

Your assertion here is not even vaguely correct, because you seem to
miss the point of APIs that have defined error returns.

        if (!dbus_message_iter_init(message, &iter)) {
            ErrorF("[config] failed to init iterator\n");
            dbus_error_free(&error);
            return DBUS_HANDLER_RESULT_NEED_MEMORY; /* ?? */
        }

I find it odd that all the functions you're claiming have no way to ever
possibly return an error under any circumstances known to mankind,
already have defined returns for having run out of memory.  Instead of,
y'know, assert(foo = malloc(...));

Since you're pretty screwed if malloc() is returning NULL, surely.  (I
think even the X server, which is pretty careful about these sorts of
things, inadvertently does five or ten malloc()s in its fatal error
path.)

> Because there is no way to know that a check failed. DBus does not 
> return an error code from *every function that takes a pointer that 
> could be NULL*, thankfully.

You say thankful, I say flawed, shortsighted, and irritatingly lazy API
design.

> >I'm somewhat bemused (and rather depressed) that the upstream attitude
> >is that D-Bus is a core component on exactly the same level as the
> >kernel (this was one of the justifications for never being able to
> >restart D-Bus), yet it does things like throw assert()s.
> 
> You're confusing three cases.
> 
>  - internal bugs in the module (dbus or kernel). _dbus_assert() in dbus,
>    and an Oops for the kernel.

No.  Internal bugs in the kernel are generally BUG_ON(), not OOPS().  An
oops is 'if I allow you to continue, then data loss will occur'.  A bug
is just that, a bug.

D-Bus conflates the two cases here into assert().

Go read kernel code some day, and then come back and tell me that no-one
bothers performing error checking, and that every single error ever
produced by the kernel results in an oops.

>  - another process sends junk to the module. This means libdbus received
>    bad data/args from another app, or the kernel gets bad system call
>    args. This leads to dbus returning a message of type ERROR or
>    disconnecting the remote app, but libdbus will not abort the current
>    app. The kernel in these cases returns a syscall error.
> 
>  - the same process uses the module in an undocumented/undefined way.
>    The kernel analogy might be a device driver misusing core kernel
>    APIs, or something like that.
>    In dbus, *often* but by no means guaranteed there is a "check"
>    (dbus_return_if_fail) to help you debug this. In the kernel, you
>    are likely to get crashes, deadlocks, or even data loss.

You're engaging in excess semantics here, by using the value of the
context to differentiate.  That gettimeofday() or open() makes a syscall
is basically irrelevant to me.  Point is, my app makes a function call,
and I'm going to be pretty annoyed if it results in an assert().
Context is irrelevant, the usage is the same.

If I call dbus_message_init() or something, and it asserts on errors,
rather than returning NULL using the already-defined API to signal an
error to the calling application, I will be incredibly irritated.

> > Given the
> >situation today, your app doesn't even have to link to D-Bus for libdbus
> >to be killing your app hard with asserts.
> 
> This is flat out wrong. If libdbus crashes due to the actions of 
> *another process* than the one libdbus is inside, then it's a bug in 
> libdbus. (OK, unless the other process is root running "kill" etc.)
> 
> The other process is untrusted and all data from it gets validated, and 
> if there's bad data then libdbus will send an error or disconnect from 
> the bad process.
> 
> If *you* in your app *trust* another app not to crash and it's buggy and 
> crashes, then yes you are hosed. But that is your own fault and libdbus 
> gives you the tools to recover gracefully. No return_if_fail checks will 
> be involved.

That is all true, but it does not change my statement.  But, 
s/link/directly &/ for clarity, and maybe now you see what I'm getting
at.

> >Seriously, the behaviour doesn't even vaguely make sense at all.  For
> >development, sure.  But there's a reason why the kernel, all POSIX APIs,
> >in fact every API ever, X, HTTP, etc, return you errors when you ask for
> >something invalid (a file that doesn't exist, an invalid URL, a
> >malformed message), rather than ungracefully slaughter you.
> 
> The dbus *protocol* behaves just like the syscall interface or X 
> protocol or HTTP; it doesn't trust the other side. What we're talking 
> about here is the libdbus *library* which is in your address space and 
> all your usage of it is trusted and known to you.

Right, so we just need someone now to rewrite libdbus, with more sound
design principles.

> The GLib and GTK+ policies are exactly the same as libdbus, except they 
> only warn by default. Recent devel releases of GNOME have gone to fatal 
> by default, however.
> 
> I prefer fatal by default because it keeps people from being confused 
> about what "undefined" means. i.e. they seem to think it means 
> "magically read my mind and figure out how to work around the bugs in my 
> app" instead of "what you requested makes no sense, so nothing sensible 
> has occurred, and your program still thinks whatever you intended *did* 
> occur, so will continue on doing more things that make no sense, and 
> it's impossible for either the library or your app to know what the 
> results will be"

Again, I don't want libdbus to continue.  Continuing with
dbus_message_send on a NULL message obviously makes no sense.  Or
whatever convoluted checks libdbus has.  But the way to deal with this,
if you're allegedly a low-level library, is to tell the app what it's
doing makes no sense.

strcpy() gets away with this because it's a brutally simple API, and
nothing could possibly go wrong.  It's also generally implemented in
asm, which should also tell you something.

libdbus is many things, but 'brutally simple' is not among them.

> If you still don't want to treat these check failures as serious bugs 
> and possible crashes, you are welcome to putenv("DBUS_FATAL_WARNINGS=0") 
> at the start of main(), and take your chances.

That is not what I want at all.  I want to know about any errors.
D_F_W=0 hides all the errors from me (the fact you use 'warnings' and
'fatal' in the same sentence should set off alarm bells, btw), and so
does assert(), but with the added bonus of an IERR later down the track.

I honestly don't understand your aversion to signalling errors to the
app.  You're labouring under some bizzare, misguided, notion that this
will fix all the app bugs in the world.  People who want to write crap
apps will write them, full stop.  They'll do them despite your best
efforts, and there's nothing you can do to stop them, except warn people
against using these horrible apps that cause data loss.

By optimising for this situation, you're penalising otherwise honest
developers who are capable of writing good apps, but make typos every
now and then, or manage to get caught in horrible race conditions.  I'd
rather my APIs were optimised for people who can get out of bed without
injuring themselves, and people who have no idea about development can
just suffer anyway, because either they'll avoid D-Bus, or find new and
creative ways to make mistakes.

Cheers,
Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.freedesktop.org/archives/dbus/attachments/20061114/818adb75/attachment.pgp