What is the policy on out-of-memory?

Ryan Saunders jediry01 at yahoo.com
Mon Jan 12 19:27:05 PST 2009


Doh! Yahoo strikes again, and I sent from the wrong address. Let's try that again.



----- Original Message ----

> Exactly, the out-of-memory should be propagated to the app. Most apps
> just ignore it, of course. But if there's a retry it happens at the
> app level.

Interesting. Not exactly straightforward, though, since developers often fail to think thru all the possible ways a thing could fail. The auth hang, for example happens because the function _dbus_transport_get_is_authenticated calls a whole bunch of auth code, which may fail, but does not propagate OOM. Thus, the caller (do_authentication) interprets the result FALSE as "not authenticated", and then proceeds to call "exchange_credentials". But in an OOM situation, it's really easy for the auth stuff to internally be in the wrong state at that point, because it's difficult to ensure (and validate!) that the auth is actually rolled back to a consistent state when OOM happens. I suspect that the transport layer code may have similar issues lurking (e.g., socket opened successfully, some data is read from it and processed, then OOM...how do we guarantee that the transport code is in a consistent state? Do we "un-read" the data back into the buffer? Do we
shut down the transport entirely?).

If we're to have any hope of doing OOM correctly, we really need all potentially failing work to be done "off to the side" (including auth state transitions, opening connections, deleting data from I/O buffers, etc.), and then to be committed into the "real" data structures via atomic, no-fail "commit" operations, or else rolled back, in the case of OOM. In exception-handling parlance, we need the "strong exception guarantee". Anything less, and we're essentially just crossing our fingers and hoping to get lucky, and we're probably as likely as not to hang the app, the bus, or both, or to crash because of incompletely modified internal state..

These sorts of cleanup guarantees are the sort of thing for which C++ destructors are really ideal: the RAII paradigm (or at least RRID), and stack-based transaction objects can make it realistic to write this sort of code. Have you considered using C++, or perhaps some subset thereof under the hood? You could remain compatible with the current no-throw public API by catching the OOM exception just inside of the DBus API boundary, returning an OOM error code instead. IMO, this would make the code much easier to write, and much, much more verifiable. Of course, strictly speaking, it's possibleto perform all the on-error cleanup correctly in the face of an OOM "exception" using only plain ol' C, but the thought of it makes my head hurt. Having the compiler create all that cleanup code automatically is soooooo much more attractive, since the compiler is much less likely to mess up than you or I are.

If DBus really must be C only, then it really needs some sort of brute-force OOM simulator unit test. The way I envision this working is that the unit test would hook dbus_malloc to simulate memory allocation failures, and then would try repeatedly to execute the high-level operation whose OOM robustness we want to validate. The first time the operation is executed, dbus_malloc will fail on the first alloc, and then the DBusConnection and friends are sanity checked for consistent state. Next, the operation will be attempted a second time, and dbus_malloc would fail on the second call. Then the third, the fourth, etc., until the operation succeeds, or until internal state corruption is detected. I'll bet you'll find a lot of issues that way. I'm not really signing myself up to do this work, but I want to be sure that y'all are aware of the issue. IMO, the current code is only pretending to handle OOM, rather than robustly, truly doing so. At best, DBus is
probably unreliable in OOM cases...at worst, it's a security breach waiting to happen.

At any rate, I think I will leave the auth hang alone for now, and just put in some comments describing the nature of the problem. Somebody can come along and try to fix the OOM propagation problem in some other checkin...for the time being, I'll just try to avoid introducing any new "inconsistent state" issues with my libsasl patch (almost done).

R


      


More information about the dbus mailing list