[systemd-devel] Tracing source of "Message has the same reply serial as a currently-outstanding existing method call" with sd-bus based client

John Ernberg john.ernberg at actia.se
Wed Aug 11 07:04:56 UTC 2021


Hi,

(systemd v243.2 - but did not see any changes to sd-bus between 243 and 
HEAD that could affect this)

I have developed a D-Bus service based on sd-bus that is a client to 2 
services. When service A sends a signal, service B writes a journald 
config file and then restarts journald using the systemd1 Manager D-Bus 
API RestartUnit method.

Both services share bus connection as they both are on the system bus.

On rare occasions this results in the callback getting the following 
D-Bus error back instead of restarting journald:
Message has the same reply serial as a currently-outstanding existing 
method call

The service doing all this is not threaded.

I'm struggling to determine the source of this problem, which may very 
well be in my code in the end, but I have no idea of what to look for in 
terms of possibly incorrect usage of sd-bus or in case of a bug in 
sd-bus, what values to check (and when) to find the root cause, and I'm 
seeking advice with this part.

Below follows a description of the client implementation, how it's 
called, and some things I found in a coredump that I thought was curious.

Please advice on how I shall move forward with my investigation of this 
problem. If more information is necessary, please let me know which 
information and I will do my best to provide it.

Thank you very much in advance.

Best regards // John Ernberg




The client code for restarting journald looks like this:

void org_freedesktop_systemd1_Manager_RestartUnit_callback(const char 
*error, const char* job) {
     if (error) {
         fprintf(stderr, "systemd1 could not restart the unit: %s\n", 
error);
     } else {
         printf("systemd1 has queued restarting the unit with this job 
path: %s\n", job);
     }
}

static int RestartUnit_internal_cbk(sd_bus_message *msg, void *userdata, 
                                    sd_bus_error *error) {
     int ret = 1;
     const sd_bus_error *e = sd_bus_message_get_error(msg);
     const char* job = NULL;

     if (sd_bus_error_is_set(e)) {

org_freedesktop_systemd1_Manager_RestartUnit_callback(e->message, job);
         goto cleanup;
     }

     ret = sd_bus_message_read(msg, "o", &job);
     if (ret < 0) {
         goto cleanup;
     }

     org_freedesktop_systemd1_Manager_RestartUnit_callback(NULL, job);

cleanup:

     return ret;
}

int org_freedesktop_systemd1_Manager_RestartUnit(const char* name, const 
char* mode)
{
     int ret = 0;
     sd_bus_message *msg = NULL;

     ret = sd_bus_message_new_method_call(bus,
                                          &msg,
                                          "org.freedesktop.systemd1",
                                          "/org/freedesktop/systemd1",

"org.freedesktop.systemd1.Manager",
                                          "RestartUnit");
     if (ret < 0) {
         return ret;
     }


     ret = sd_bus_message_append(msg, "s", name);
     if (ret < 0) {
         goto cleanup;
     }

     ret = sd_bus_message_append(msg, "s", mode);
     if (ret < 0) {
         goto cleanup;
     }


     ret = sd_bus_call_async(bus, NULL, msg, &RestartUnit_internal_cbk,
                             NULL, 0);
     if (ret < 0) {
         goto cleanup;
     }

cleanup:
     sd_bus_message_unref(msg);
     return ret;
}

The org_freedesktop_systemd1_Manager_RestartUnit function is called from 
the callback of the signal from service A.

I added an abort() to the RestartUnit callback if there was an error, 
and managed to catch a core of that state.

The only thing I managed to conclude of that so far is that there was 0 
entries in the reply_callbacks list at the time of invoking the 
RestartUnit callback, meaning there was only 1 entry in the hashmap and 
there were 2 slots allocated to RestartUnit_internal_cbk.

 From gdb in this coredump:

(gdb) print *bus->slots
$48 = {n_ref = 1, type = BUS_REPLY_CALLBACK, floating = true, 
match_added = false, bus = 0xaaab1fc647d0, userdata = 0x0, 
destroy_callback = 0x0, description = 0x0, slots_next = 0xaaab1fc6bcf0, 
slots_prev = 0x0, {reply_callback = {
       callback = 0xaaaae64d8320 <RestartUnit_internal_cbk>, 
timeout_usec = 52303930, cookie = 6, prioq_idx = 0}, filter_callback = 
{callback = 0xaaaae64d8320 <RestartUnit_internal_cbk>, last_iteration = 
52303930, callbacks_next = 0x6,
       callbacks_prev = 0x0}, match_callback = {callback = 
0xaaaae64d8320 <RestartUnit_internal_cbk>, install_callback = 0x31e183a, 
install_slot = 0x6, last_iteration = 0, after = 4673, match_string = 
0xffffa0000080 "\360\066",
       match_node = 0xffffa0000080}, node_callback = {node = 
0xaaaae64d8320 <RestartUnit_internal_cbk>, is_fallback = false, 
last_iteration = 0, callback = 0x6, callbacks_next = 0x0, callbacks_prev 
= 0x1241}, node_enumerator = {
       node = 0xaaaae64d8320 <RestartUnit_internal_cbk>, callback = 
0x31e183a, last_iteration = 6, enumerators_next = 0x0, enumerators_prev 
= 0x1241}, node_object_manager = {node = 0xaaaae64d8320 
<RestartUnit_internal_cbk>,
       object_managers_next = 0x31e183a, object_managers_prev = 0x6}, 
node_vtable = {node = 0xaaaae64d8320 <RestartUnit_internal_cbk>, 
is_fallback = false, last_iteration = 0, interface = 0x6 <error: Cannot 
access memory at address 0x6>,
       vtable = 0x0, find = 0x1241, vtables_next = 0xffffa0000080, 
vtables_prev = 0xffffa0000080}}}
(gdb) print *bus->slots->slots_next
$49 = {n_ref = 2, type = BUS_REPLY_CALLBACK, floating = true, 
match_added = false, bus = 0xaaab1fc647d0, userdata = 0x0, 
destroy_callback = 0x0, description = 0x0, slots_next = 0xaaab1fc68aa0, 
slots_prev = 0xffffa00021f0, {
     reply_callback = {callback = 0xaaaae64d8320 
<RestartUnit_internal_cbk>, timeout_usec = 0, cookie = 0, prioq_idx = 
0}, filter_callback = {callback = 0xaaaae64d8320 
<RestartUnit_internal_cbk>, last_iteration = 0, callbacks_next = 0x0,
       callbacks_prev = 0x0}, match_callback = {callback = 
0xaaaae64d8320 <RestartUnit_internal_cbk>, install_callback = 0x0, 
install_slot = 0x0, last_iteration = 0, after = 881,
       match_string = 0x1 <error: Cannot access memory at address 0x1>, 
match_node = 0xaaab1fc647d0}, node_callback = {node = 0xaaaae64d8320 
<RestartUnit_internal_cbk>, is_fallback = false, last_iteration = 0, 
callback = 0x0,
       callbacks_next = 0x0, callbacks_prev = 0x371}, node_enumerator = 
{node = 0xaaaae64d8320 <RestartUnit_internal_cbk>, callback = 0x0, 
last_iteration = 0, enumerators_next = 0x0, enumerators_prev = 0x371}, 
node_object_manager = {
       node = 0xaaaae64d8320 <RestartUnit_internal_cbk>, 
object_managers_next = 0x0, object_managers_prev = 0x0}, node_vtable = 
{node = 0xaaaae64d8320 <RestartUnit_internal_cbk>, is_fallback = false, 
last_iteration = 0, interface = 0x0,
       vtable = 0x0, find = 0x371, vtables_next = 0x1, vtables_prev = 
0xaaab1fc647d0}}}

Where-as if I stop service B in the callback using a breakpoint under 
normal circumstances there is only 1 entry in the slots for that 
callback. Which looks like the second entry.


More information about the systemd-devel mailing list