[systemd-devel] Solution proposal for bug 56109

Sun May 19 03:30:56 PDT 2013

Hi,

On Thu, May 16, 2013 at 06:23:15PM +0200, markohoyer at freenet.de wrote:
> I recently stumbled over the bug with the watchdog mechanism that has already been reported to free desktop bugzilla (56109).
>  
> I analyzed the bug and came to a simple solution for solving it.
>  
> First, what I think is going on:
> -        watchdog timeout is detected in service_handle_watchdog(), service_enter_dead(…) is called
> -        service_enter_dead() sets the service state to auto_restart
> -        triggered by a timer, service_enter_restart is called
> -        service_enter_restart  schedules a restart job
> -        systemd splits up the jobs into a stop and a start job and schedules both
> -        the stop job lasts to a call of service_stop()
> -        here it begins to get interesting:
> -        based on the AUTO_RESTART state, this function decides to go directly into dead state, nothing of the normal stopping procedure is done. This is probably because in most cases that cause a restart to be scheduled the stop proceeding is done automatically (for instance in case of a killed or normally exiting service.). But this is not true for a watchdog timeout. Nothing of the stop proceeding is executed in case of such a timeout. So the process that missed to send the watchdog event is going on to life (in which state ever). No one is cleaning up. A second instance of the service is started.

I'm quite sure this worked correctly at some point, when I first
implemented it. But I can reproduce the Problem here.

> My suggestion to solve this:
>  
> Changes are needed in service.c in service_stop(…).
>  
> change:
> /* A restart will be scheduled or is in progress. */
>         if (s->state == SERVICE_AUTO_RESTART) {
>                 service_set_state(s, SERVICE_DEAD);
>                 return 0;
>         }
>  
> to:
> /* A restart will be scheduled or is in progress. 
>            In all cases but the watchdog timeout, stop is already progressed by systemd automatically*/
>         if (s->state == SERVICE_AUTO_RESTART && s->result != SERVICE_FAILURE_WATCHDOG) {
>                 service_set_state(s, SERVICE_DEAD);
>                 return 0;
>         }
>  
> and change:
>  
> assert(s->state == SERVICE_RUNNING ||
>              s->state == SERVICE_EXITED);
>  
>  
> to:
> assert(s->state == SERVICE_RUNNING ||
>                s->state == SERVICE_AUTO_RESTART ||
>                s->state == SERVICE_EXITED);
>  
> I tested the following:
> -        the watchdog mechanism is now actually stopping / killing the service in case it is not sending the watchdog event right in time
> -        a restart triggered by a killed service works like before
>  
> Hopefully, I didn’t miss some side effects caused by my changes.
>  
>  
> Any opinions on my proposed changes?

No, this is wrong. The change should happen in service_handle_watchdog().
Calling service_enter_dead() is wrong. Calling service_enter_stop() there
would probably create more or less what your change does.
But I think this is still wrong. When this code is executed then the
service is in an undefined state. Calling the normal shutdown mechanism can
imho actually be harmful. I think just killing the process here is correct.
I've sent a patch, that does this.

Btw, I've tested both versions with a worst-case failure (a process that
cannot be killed): With your code it waits 4x TimeoutStopSec before the
service is started again. With my patch it's 1x TimeoutStopSec.

Regards,
Michael

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |