[systemd-devel] systemd "hangs" for several minutes on shutdown if it gets "service force-reload" request

Sat Jan 29 03:47:49 PST 2011

On 01/29/2011 05:33 AM, Andrey Borzenkov wrote:
> On Thu, Jan 27, 2011 at 11:50 PM, Andrey Borzenkov <arvidjaar at gmail.com> wrote:
>> On Fri, Jan 7, 2011 at 5:33 AM, Lennart Poettering
>> <lennart at poettering.net> wrote:
>>> On Wed, 01.12.10 22:55, Andrey Borzenkov (arvidjaar at gmail.com) wrote:
>>>
>>>> In Mandriva we are using {ifup,ifdown}.d script for callouts. One
>>>> package install script that - in both cases - does "service vnstat
>>>> force-reload". During shutdown it is causing interesting effect - this
>>>> request hangs which causing network service (that indirectly calls it)
>>>> to hang as well and be finally killed:
>>>
>>> [...]
>>>
>>>> What I wonder in this case, why vnstat.service/reload appears to hang
>>>> in this case? The job itself fails during startup (initscript is
>>>> disabled, so it gets indirect request from ifup and does nothing):
>>>
>>> Does this probelm still exist?
>>>
>>
>> Well, I just had shutdown stuck for 3 minutes using v17. It is hard to
>> tell whether this is the same case - either systemd does not report
>> timeout without log_level-debug or it scrolls up too fast to notice.
>>
>> The question still remains - why systemd hangs in this case?
>>
>>> Most likely this is simply an ordering deadlock: systemd executes
>>> something that asks systemd to execute something else which however is
>>> order after the first unit.
>>
>> vnstat itself does not have any explicit dependencies. So in this case
>> this were caused by implicit dependencies added by systemd ....
>>
> 
> Yes. Like in https://bugs.freedesktop.org/show_bug.cgi?id=33421 it is
> implicit dependencies that are added to every unit.
> 
> Every service depends on basic.target. basic.target is stopped as part
> of shutdown sequence. According to current systemd rules, when two
> units have any after/before dependency, start request always waits for
> stop request. So in this case start request waits until basic.target
> is stopped ... which effectively means system is shutdown and no start
> is needed anymore :)
> 
>>  That is is not really fixable. At least I am
>>> have no idea what we could do about this.
>>>
> 
> It does not suggest any answer, I just try to put together what I have
> seen so far.
> 
> 1. real life systemd deployment desperately needs adequate diagnostic
> means. Today no indication of deadlock is given even in debug output,
> it is not possible to see relative job order and it is not possible to
> simulate shutdown sequence. All of this makes debugging such cases
> harder than it could be.
> 
> 2. What is the reason for "start foo" wait for "stop bar" in case foo
> is ordered after bar? It appears to be a nice programming trick to
> auto-order in restart case, but it seems to cause issues otherwise.
> May be this condition can be relaxed.
> 
> 3. Extend transaction definition to include "after state". I.e. if
> transaction results in unit foo to be stopped (implicitly or
> explicitly), reject any attempt to start foo until transaction
> successfully finished.
> 
> 4. Special case 3 for shutdown processing only
> 
> 5. Full featured deadlock detection. I am not sure if this is always
> possible - can we always determine who initiated transaction?

Don't know if it is possible, but the following would be almost ideal
debugging thing for cases I am having problems with - something like a
signal which systemd would catch and dump everything it is doing right
now; or dump its last 5/10/20 performed actions to console.

The 'mostly impossible' part of it is that when such deadlocks happen,
the system is either starting, or shutting down, so you don't have any
way to send a signal, and magic sysrq key won't help much either.

Maybe adding some debug_timer, which would dump this information
automatically if some process/task/job is stuck for more than
debug_times=N seconds?

-- 
Eugeni Dodonov