What is the policy on out-of-memory?

Mon Jan 12 19:43:58 PST 2009

Hi,

On Mon, Jan 12, 2009 at 9:31 PM, Ryan Saunders
<saunders at aggienetwork.com> wrote:
> The auth hang, for example happens because the function _dbus_transport_get_is_authenticated
> calls a whole bunch of auth code, which may fail, but does not propagate OOM. Thus, the caller (do_authentication)

The OOM is propagated through _dbus_auth_do_work which can return an
OOM state. exchange_credentials() can also return FALSE on OOM. So it
may just be a matter of bumping the OOM through get_is_authenticated
and up to avoid the exchange_credentials(). (Run the "exchange
credentials returned FALSE" codepath if get_is_authenticated gets OOM,
in essence.) Something along those lines. Maybe just add a
get_is_authenticated variant and keep using the current one elsewhere,
if that works.

I'm not sure that's the best solution, it sort of depends on why the
new auth backend ends up fighting the way things work right now, there
may be some cleaner way to rearrange the code.

> how do we guarantee that the transport code is in a consistent state? Do we "un-read" the data back into the buffer? Do we
>  shut down the transport entirely?).

The approach in the bus daemon for oom in most cases is to report an
error for the current message and roll back all handling of that
message. There are few allocations that aren't in the context of
processing a particular message.

For auth failure, I think it's a fine OOM response to simply fail to
complete the connection and drop the connection.

In apps, handling OOM is rarely done and not really useful anyway. In
the bus daemon, just refusing to establish the app connection should
be fine. The main reason the daemon handles OOM is that the daemon
isn't allowed to crash and clients could maliciously try to make it
OOM to cause it to crash.

I should point out, the OOM handling in libdbus is primarily for the
benefit of dbus-daemon. I don't know of a single app that's handling
OOM properly even in theory; there may be some, but I'm skeptical they
work in practice, at least for GUI apps. Most apps use dbus-glib and
dbus-glib just applies glib's policy of exiting with error on most OOM
situations.

> If we're to have any hope of doing OOM correctly, we really need all potentially failing work to be done "off to the side" (including auth state
> transitions, opening connections, deleting data from I/O buffers, etc.), and then to be committed into the "real" data structures via atomic, no-fail
> "commit" operations, or else rolled back, in the case of OOM.

Correct, this is what dbus-daemon does for message handling for
example. In the bus/ subdir, look for the Transaction object. After
OOM failure things have to be left in a consistent state. It isn't
always possible to leave them in the same state as before, and
depending on the situation, that can be fine. But a consistent state,
yes.

> These sorts of cleanup guarantees are the sort of thing for which C++ destructors are really ideal: the RAII paradigm (or at least RRID), and
> stack-based transaction objects can make it realistic to write this sort of code. Have you considered using C++, or perhaps some subset thereof
> under the hood?

I don't personally find C++ helpful, ultimately the only way to make
OOM work is full test coverage. Aside from that though, dbus is
committed to avoiding "controversial" dependencies because it's such a
low-level component and has a lot of people using it with different
needs.

> If DBus really must be C only, then it really needs some sort of brute-force OOM simulator unit test.
>... I'll bet you'll find a lot of issues that way.

In fact the dbus "make check" does exactly as you describe - great
minds? It did find a lot of issues, but also gives some confidence
that things mostly work as they are now.
http://log.ometer.com/2008-02.html#4.2

New code should have unit tests using the same mechanism, if possible.
It may be challenging to unit test the auth stuff if it depends on
Kerberos being configured or something, but perhaps there's some way
to encapsulate and then mock the external dependencies, I don't know.
At the least, you could probably add a test-only bogus auth mechanism
that had possible OOM failures in new places.

You may well find that if you make a function now able to fail due to
OOM, that could not before, some significant refactoring can suddenly
become necessary as you clean up all the calling code.

To work on refactoring the current code, it may be sufficient to just
put a gratuitous dbus_malloc in one of the existing auth mechanisms,
which should cause make check to then fail that malloc at some point
during the test suite. I'm pretty sure the auth code is covered by the
OOM torture test.

> IMO, the current code is only pretending to handle OOM, rather than robustly, truly doing so. At best, DBus is
>  probably unreliable in OOM cases...at worst, it's a security breach waiting to happen.

While there are doubtless bugs, there has been as high as 70% of basic
blocks test coverage, including most of the allocation failure points.
(I haven't checked coverage lately, it's probably down a bit, but
still good.) 70% is very high ... most code coverage percentages one
sees are for functions or something, not basic blocks.

So, I do have a lot of confidence that most of the code is fine on the
OOM front. But I'm sure there are bugs, and when found, they just need
fixing.

Havoc