[systemd-devel] Solution proposal for bug 56109
Michael Olbrich
m.olbrich at pengutronix.de
Sun May 19 03:30:56 PDT 2013
Hi,
On Thu, May 16, 2013 at 06:23:15PM +0200, markohoyer at freenet.de wrote:
> I recently stumbled over the bug with the watchdog mechanism that has already been reported to free desktop bugzilla (56109).
>
> I analyzed the bug and came to a simple solution for solving it.
>
> First, what I think is going on:
> - watchdog timeout is detected in service_handle_watchdog(), service_enter_dead(…) is called
> - service_enter_dead() sets the service state to auto_restart
> - triggered by a timer, service_enter_restart is called
> - service_enter_restart schedules a restart job
> - systemd splits up the jobs into a stop and a start job and schedules both
> - the stop job lasts to a call of service_stop()
> - here it begins to get interesting:
> - based on the AUTO_RESTART state, this function decides to go directly into dead state, nothing of the normal stopping procedure is done. This is probably because in most cases that cause a restart to be scheduled the stop proceeding is done automatically (for instance in case of a killed or normally exiting service.). But this is not true for a watchdog timeout. Nothing of the stop proceeding is executed in case of such a timeout. So the process that missed to send the watchdog event is going on to life (in which state ever). No one is cleaning up. A second instance of the service is started.
I'm quite sure this worked correctly at some point, when I first
implemented it. But I can reproduce the Problem here.
> My suggestion to solve this:
>
> Changes are needed in service.c in service_stop(…).
>
> change:
> /* A restart will be scheduled or is in progress. */
> if (s->state == SERVICE_AUTO_RESTART) {
> service_set_state(s, SERVICE_DEAD);
> return 0;
> }
>
> to:
> /* A restart will be scheduled or is in progress.
> In all cases but the watchdog timeout, stop is already progressed by systemd automatically*/
> if (s->state == SERVICE_AUTO_RESTART && s->result != SERVICE_FAILURE_WATCHDOG) {
> service_set_state(s, SERVICE_DEAD);
> return 0;
> }
>
> and change:
>
> assert(s->state == SERVICE_RUNNING ||
> s->state == SERVICE_EXITED);
>
>
> to:
> assert(s->state == SERVICE_RUNNING ||
> s->state == SERVICE_AUTO_RESTART ||
> s->state == SERVICE_EXITED);
>
> I tested the following:
> - the watchdog mechanism is now actually stopping / killing the service in case it is not sending the watchdog event right in time
> - a restart triggered by a killed service works like before
>
> Hopefully, I didn’t miss some side effects caused by my changes.
>
>
> Any opinions on my proposed changes?
No, this is wrong. The change should happen in service_handle_watchdog().
Calling service_enter_dead() is wrong. Calling service_enter_stop() there
would probably create more or less what your change does.
But I think this is still wrong. When this code is executed then the
service is in an undefined state. Calling the normal shutdown mechanism can
imho actually be harmful. I think just killing the process here is correct.
I've sent a patch, that does this.
Btw, I've tested both versions with a worst-case failure (a process that
cannot be killed): With your code it waits 4x TimeoutStopSec before the
service is started again. With my patch it's 1x TimeoutStopSec.
Regards,
Michael
--
Pengutronix e.K. | |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
More information about the systemd-devel
mailing list