[systemd-devel] systemd "hangs" for several minutes on shutdown if it gets "service force-reload" request

Fri Jan 28 23:33:52 PST 2011

On Thu, Jan 27, 2011 at 11:50 PM, Andrey Borzenkov <arvidjaar at gmail.com> wrote:
> On Fri, Jan 7, 2011 at 5:33 AM, Lennart Poettering
> <lennart at poettering.net> wrote:
>> On Wed, 01.12.10 22:55, Andrey Borzenkov (arvidjaar at gmail.com) wrote:
>>
>>> In Mandriva we are using {ifup,ifdown}.d script for callouts. One
>>> package install script that - in both cases - does "service vnstat
>>> force-reload". During shutdown it is causing interesting effect - this
>>> request hangs which causing network service (that indirectly calls it)
>>> to hang as well and be finally killed:
>>
>> [...]
>>
>>> What I wonder in this case, why vnstat.service/reload appears to hang
>>> in this case? The job itself fails during startup (initscript is
>>> disabled, so it gets indirect request from ifup and does nothing):
>>
>> Does this probelm still exist?
>>
>
> Well, I just had shutdown stuck for 3 minutes using v17. It is hard to
> tell whether this is the same case - either systemd does not report
> timeout without log_level-debug or it scrolls up too fast to notice.
>
> The question still remains - why systemd hangs in this case?
>
>> Most likely this is simply an ordering deadlock: systemd executes
>> something that asks systemd to execute something else which however is
>> order after the first unit.
>
> vnstat itself does not have any explicit dependencies. So in this case
> this were caused by implicit dependencies added by systemd ....
>

Yes. Like in https://bugs.freedesktop.org/show_bug.cgi?id=33421 it is
implicit dependencies that are added to every unit.

Every service depends on basic.target. basic.target is stopped as part
of shutdown sequence. According to current systemd rules, when two
units have any after/before dependency, start request always waits for
stop request. So in this case start request waits until basic.target
is stopped ... which effectively means system is shutdown and no start
is needed anymore :)

>  That is is not really fixable. At least I am
>> have no idea what we could do about this.
>>

It does not suggest any answer, I just try to put together what I have
seen so far.

1. real life systemd deployment desperately needs adequate diagnostic
means. Today no indication of deadlock is given even in debug output,
it is not possible to see relative job order and it is not possible to
simulate shutdown sequence. All of this makes debugging such cases
harder than it could be.

2. What is the reason for "start foo" wait for "stop bar" in case foo
is ordered after bar? It appears to be a nice programming trick to
auto-order in restart case, but it seems to cause issues otherwise.
May be this condition can be relaxed.

3. Extend transaction definition to include "after state". I.e. if
transaction results in unit foo to be stopped (implicitly or
explicitly), reject any attempt to start foo until transaction
successfully finished.

4. Special case 3 for shutdown processing only

5. Full featured deadlock detection. I am not sure if this is always
possible - can we always determine who initiated transaction?