[systemd-devel] Tracing source of "Message has the same reply serial as a currently-outstanding existing method call" with sd-bus based client
John Ernberg
john.ernberg at actia.se
Wed Aug 11 07:04:56 UTC 2021
Hi,
(systemd v243.2 - but did not see any changes to sd-bus between 243 and
HEAD that could affect this)
I have developed a D-Bus service based on sd-bus that is a client to 2
services. When service A sends a signal, service B writes a journald
config file and then restarts journald using the systemd1 Manager D-Bus
API RestartUnit method.
Both services share bus connection as they both are on the system bus.
On rare occasions this results in the callback getting the following
D-Bus error back instead of restarting journald:
Message has the same reply serial as a currently-outstanding existing
method call
The service doing all this is not threaded.
I'm struggling to determine the source of this problem, which may very
well be in my code in the end, but I have no idea of what to look for in
terms of possibly incorrect usage of sd-bus or in case of a bug in
sd-bus, what values to check (and when) to find the root cause, and I'm
seeking advice with this part.
Below follows a description of the client implementation, how it's
called, and some things I found in a coredump that I thought was curious.
Please advice on how I shall move forward with my investigation of this
problem. If more information is necessary, please let me know which
information and I will do my best to provide it.
Thank you very much in advance.
Best regards // John Ernberg
The client code for restarting journald looks like this:
void org_freedesktop_systemd1_Manager_RestartUnit_callback(const char
*error, const char* job) {
if (error) {
fprintf(stderr, "systemd1 could not restart the unit: %s\n",
error);
} else {
printf("systemd1 has queued restarting the unit with this job
path: %s\n", job);
}
}
static int RestartUnit_internal_cbk(sd_bus_message *msg, void *userdata,
sd_bus_error *error) {
int ret = 1;
const sd_bus_error *e = sd_bus_message_get_error(msg);
const char* job = NULL;
if (sd_bus_error_is_set(e)) {
org_freedesktop_systemd1_Manager_RestartUnit_callback(e->message, job);
goto cleanup;
}
ret = sd_bus_message_read(msg, "o", &job);
if (ret < 0) {
goto cleanup;
}
org_freedesktop_systemd1_Manager_RestartUnit_callback(NULL, job);
cleanup:
return ret;
}
int org_freedesktop_systemd1_Manager_RestartUnit(const char* name, const
char* mode)
{
int ret = 0;
sd_bus_message *msg = NULL;
ret = sd_bus_message_new_method_call(bus,
&msg,
"org.freedesktop.systemd1",
"/org/freedesktop/systemd1",
"org.freedesktop.systemd1.Manager",
"RestartUnit");
if (ret < 0) {
return ret;
}
ret = sd_bus_message_append(msg, "s", name);
if (ret < 0) {
goto cleanup;
}
ret = sd_bus_message_append(msg, "s", mode);
if (ret < 0) {
goto cleanup;
}
ret = sd_bus_call_async(bus, NULL, msg, &RestartUnit_internal_cbk,
NULL, 0);
if (ret < 0) {
goto cleanup;
}
cleanup:
sd_bus_message_unref(msg);
return ret;
}
The org_freedesktop_systemd1_Manager_RestartUnit function is called from
the callback of the signal from service A.
I added an abort() to the RestartUnit callback if there was an error,
and managed to catch a core of that state.
The only thing I managed to conclude of that so far is that there was 0
entries in the reply_callbacks list at the time of invoking the
RestartUnit callback, meaning there was only 1 entry in the hashmap and
there were 2 slots allocated to RestartUnit_internal_cbk.
From gdb in this coredump:
(gdb) print *bus->slots
$48 = {n_ref = 1, type = BUS_REPLY_CALLBACK, floating = true,
match_added = false, bus = 0xaaab1fc647d0, userdata = 0x0,
destroy_callback = 0x0, description = 0x0, slots_next = 0xaaab1fc6bcf0,
slots_prev = 0x0, {reply_callback = {
callback = 0xaaaae64d8320 <RestartUnit_internal_cbk>,
timeout_usec = 52303930, cookie = 6, prioq_idx = 0}, filter_callback =
{callback = 0xaaaae64d8320 <RestartUnit_internal_cbk>, last_iteration =
52303930, callbacks_next = 0x6,
callbacks_prev = 0x0}, match_callback = {callback =
0xaaaae64d8320 <RestartUnit_internal_cbk>, install_callback = 0x31e183a,
install_slot = 0x6, last_iteration = 0, after = 4673, match_string =
0xffffa0000080 "\360\066",
match_node = 0xffffa0000080}, node_callback = {node =
0xaaaae64d8320 <RestartUnit_internal_cbk>, is_fallback = false,
last_iteration = 0, callback = 0x6, callbacks_next = 0x0, callbacks_prev
= 0x1241}, node_enumerator = {
node = 0xaaaae64d8320 <RestartUnit_internal_cbk>, callback =
0x31e183a, last_iteration = 6, enumerators_next = 0x0, enumerators_prev
= 0x1241}, node_object_manager = {node = 0xaaaae64d8320
<RestartUnit_internal_cbk>,
object_managers_next = 0x31e183a, object_managers_prev = 0x6},
node_vtable = {node = 0xaaaae64d8320 <RestartUnit_internal_cbk>,
is_fallback = false, last_iteration = 0, interface = 0x6 <error: Cannot
access memory at address 0x6>,
vtable = 0x0, find = 0x1241, vtables_next = 0xffffa0000080,
vtables_prev = 0xffffa0000080}}}
(gdb) print *bus->slots->slots_next
$49 = {n_ref = 2, type = BUS_REPLY_CALLBACK, floating = true,
match_added = false, bus = 0xaaab1fc647d0, userdata = 0x0,
destroy_callback = 0x0, description = 0x0, slots_next = 0xaaab1fc68aa0,
slots_prev = 0xffffa00021f0, {
reply_callback = {callback = 0xaaaae64d8320
<RestartUnit_internal_cbk>, timeout_usec = 0, cookie = 0, prioq_idx =
0}, filter_callback = {callback = 0xaaaae64d8320
<RestartUnit_internal_cbk>, last_iteration = 0, callbacks_next = 0x0,
callbacks_prev = 0x0}, match_callback = {callback =
0xaaaae64d8320 <RestartUnit_internal_cbk>, install_callback = 0x0,
install_slot = 0x0, last_iteration = 0, after = 881,
match_string = 0x1 <error: Cannot access memory at address 0x1>,
match_node = 0xaaab1fc647d0}, node_callback = {node = 0xaaaae64d8320
<RestartUnit_internal_cbk>, is_fallback = false, last_iteration = 0,
callback = 0x0,
callbacks_next = 0x0, callbacks_prev = 0x371}, node_enumerator =
{node = 0xaaaae64d8320 <RestartUnit_internal_cbk>, callback = 0x0,
last_iteration = 0, enumerators_next = 0x0, enumerators_prev = 0x371},
node_object_manager = {
node = 0xaaaae64d8320 <RestartUnit_internal_cbk>,
object_managers_next = 0x0, object_managers_prev = 0x0}, node_vtable =
{node = 0xaaaae64d8320 <RestartUnit_internal_cbk>, is_fallback = false,
last_iteration = 0, interface = 0x0,
vtable = 0x0, find = 0x371, vtables_next = 0x1, vtables_prev =
0xaaab1fc647d0}}}
Where-as if I stop service B in the callback using a breakpoint under
normal circumstances there is only 1 entry in the slots for that
callback. Which looks like the second entry.
More information about the systemd-devel
mailing list