[systemd-devel] Solution proposal for bug 56109

Thu May 16 09:23:15 PDT 2013

Hello systemd experts and developers,

I recently stumbled over the bug with the watchdog mechanism that has already been reported to free desktop bugzilla (56109).

I analyzed the bug and came to a simple solution for solving it.

First, what I think is going on:
-        watchdog timeout is detected in service_handle_watchdog(), service_enter_dead(…) is called
-        service_enter_dead() sets the service state to auto_restart
-        triggered by a timer, service_enter_restart is called
-        service_enter_restart  schedules a restart job
-        systemd splits up the jobs into a stop and a start job and schedules both
-        the stop job lasts to a call of service_stop()
-        here it begins to get interesting:
-        based on the AUTO_RESTART state, this function decides to go directly into dead state, nothing of the normal stopping procedure is done. This is probably because in most cases that cause a restart to be scheduled the stop proceeding is done automatically (for instance in case of a killed or normally exiting service.). But this is not true for a watchdog timeout. Nothing of the stop proceeding is executed in case of such a timeout. So the process that missed to send the watchdog event is going on to life (in which state ever). No one is cleaning up. A second instance of the service is started.

My suggestion to solve this:

Changes are needed in service.c in service_stop(…).

change:
/* A restart will be scheduled or is in progress. */
        if (s->state == SERVICE_AUTO_RESTART) {
                service_set_state(s, SERVICE_DEAD);
                return 0;
        }

to:
/* A restart will be scheduled or is in progress. 
           In all cases but the watchdog timeout, stop is already progressed by systemd automatically*/
        if (s->state == SERVICE_AUTO_RESTART && s->result != SERVICE_FAILURE_WATCHDOG) {
                service_set_state(s, SERVICE_DEAD);
                return 0;
        }

and change:

assert(s->state == SERVICE_RUNNING ||
             s->state == SERVICE_EXITED);

to:
assert(s->state == SERVICE_RUNNING ||
               s->state == SERVICE_AUTO_RESTART ||
               s->state == SERVICE_EXITED);

I tested the following:
-        the watchdog mechanism is now actually stopping / killing the service in case it is not sending the watchdog event right in time
-        a restart triggered by a killed service works like before

Hopefully, I didn’t miss some side effects caused by my changes.

Any opinions on my proposed changes?

Kind regards,

Marko Hoyer

---
Alle Postfächer an einem Ort. Jetzt wechseln und E-Mail-Adresse mitnehmen! Rundum glücklich mit freenetMail
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/systemd-devel/attachments/20130516/4f75c277/attachment-0001.html>