[systemd-devel] [RFC v2 3/6] kthread: warn on kill signal if not OOM

Thu Sep 11 14:43:28 PDT 2014

On Wed, Sep 10, 2014 at 11:10 PM, Luis R. Rodriguez
<mcgrof at do-not-panic.com> wrote:
>>> More than two years
>>> have gone by on growing design and assumptions on top of that original
>>> commit. I'm not sure if *systemd folks* yet believe its was a design
>>> regression?
>>
>> I don't think so. udev should not allow its workers to run for an
>> unbounded length of time. Whether the upper bound should be 30, 60,
>> 180 seconds or something else is up for debate (currently it is 60,
>> but if that is too short for some drivers we could certainly revisit
>> that).
>
> That's the thing -- the timeout was put in place under the assumption
> probe was asyncronous and its not, the driver core issues both module
> init *and* probe together, the loader has to wait. That alone makes
> the timeout a design flaw, and then systemd carried on top of that
> design over two years after that. Its not systemd's fault, its just
> that we never spoke about this as a design thing broadly and we should
> have, and I will mention that even when the first issues creeped up,
> the issue was still tossed back a driver problems. It was only until
> recently that we realized that both init and probe run together that
> we've been thinking about this problem differently. Systemd was trying
> to ensure init on driver don't take long but its not init that is
> taking long, its probe, and probe gets then penalized as the driver
> core batches both init and probe synchronously before finishing module
> loading.

Just to clarify: udev/systemd is not trying to get into the business
of what the kernel does on finit_module(), we just need to make sure
that none of our workers stay around forever, which is why we have a
global timeout. If necessary we can bump this higher (as mentioned in
another thread I just bumped it to 180 secs), but we cannot abolish it
entirely.

> Furthermore as clarified by Tejun random userland is known to
> exist that will wait indefinitely for module loading under the simple
> assumption things *are done synchronously*, and its precisely why we
> can't just blindly enable async probe upstream through a new driver
> boolean as it can be unfair to this old userland. What is being
> evaluated is to enable aync probe for *all* drivers through a new
> general system-wide option. We cannot regress old userspace and
> assumptions but we can create a new shiny universe.

How about simply introducing a new flag to finit_module() to indicate
that the caller does not care about asynchronicity. We could then pass
this from udev, but existing scripts calling modprobe/insmod will not
be affected.

>> Moreover, it seems from this discussion that the aim is (still)
>> that insmod should be near-instantaneous (i.e., not wait for probe),
>
> The only reason that is being discussed is that systemd has not
> accepted the timeout as a system design despite me pointing out the
> original design flaw recently and at this point even if was accepted
> as a design flaw it seems its too late. The approach taken to help
> make all drivers async is simply an afterthought to give systemd what
> it *thought* was in place, and it by no measure should be considered
> the proper fix to the regression introduced by systemd, it may perhaps
> be the right step long term for systemd systems given it goes with
> what it assumed was there, but the timeout was flawed. Its not clear
> if systemd can help with old kernels, it seems the ship has sailed and
> there seems no options but for folks to work around that -- unless of
> course some reasonable solution is found which doesn't break current
> systemd design?

If I read the git logs correctly the hard timeout was introduced in
April 2011, so reverting it now seems indeed not to help much with all
the running systems out there.

> As part of this series I addressed hunting for the  "misbehaving
> drivers" in-kernel as I saw no progress on the systemd side of things
> to non-fatally detect "misbehaving drivers" despite my original RFCs
> and request for advice. I quote  "misbehaving drivers" as its a flawed
> view to consider them misbehaving now in light of clarifications of
> how the driver core works in that it batches both init and probe
> together always and we can't be penalizing long probes due to the fact
> long probes are simply fine. My patch to pick up "misbehaving drivers"
> drivers on the kernel front by picking up systemd's signal was
> reactive but it was also simply braindead given the same exact reasons
> why systemd's original timeout was flawed. We want a general solution
> and we don't want to work around the root cause, in this case it was
> systemd's assumption on how drivers work.

Would your ongoing work to make probing asynchronous solve this
problem in the long-term? In the short-term I guess bumping the udev
timeout should be sufficient.

> Keep in mind that the above just addresses kmod built-in cmd on
> systemd, its where the timeout was introduced but as has been
> clarified here assuming the same timeout on *all* other built-in
> likely is likely pretty flawed as well and this does concern me. Its
> why I mentioned that more than two years have gone by now on growing
> design and assumptions on top of that original commit and its why its
> hard for systemd to consider an alternative.

All built-ins should be near-instantaneous. If they are not, that
needs to be fixed, or they should not be udev built-ins at all. I have
now added a warning to udev if any builtin-in takes more than a third
of the timeout, so hopefully any problems should be spotted early.

>>>>>  I'm afraid distributions that want to avoid this
>>>>> sigkill at least on the kernel front will have to work around this
>>>>> issue either on systemd by increasing the default timeout which is now
>>>>> possible thanks to Hannes' changes or by some other means such as the
>>>>> combination of a modified non-chatty version of this patch + a check
>>>>> at the end of load_module() as mentioned earlier on these threads.
>>>>
>>>> Increasing the default timeout in systemd seems like the obvious bug fix
>>>> to me.  If the patch exists already, having distros that want it use it
>>>> looks to be correct ... not every bug is a kernel bug, after all.
>>>
>>> Its merged upstream on systemd now, along with a few fixes on top of
>>> it. I also see Kay merged a change to the default timeout to 60 second
>>> on August 30. Its unclear if these discussions had any impact on that
>>> decision or if that was just because udev firmware loading got now
>>> ripped out. I'll note that the new 60 second timeout wouldn't suffice
>>> for cxgb4 even if it didn't do firmware loading, its probe takes over
>>> one full minute.
>>>
>>>> Negotiating a probe vs init split for drivers is fine too, but it's a
>>>> longer term thing rather than a bug fix.
>>>
>>> Indeed. What I proposed with a multiplier for the timeout for the
>>> different types of built in commands was deemed complex but saw no
>>> alternatives proposed despite my interest to work on one and
>>> clarifications noted that this was a design regression. Not quite sure
>>> what else I could have done here. I'm interested in learning what the
>>> better approach is for the future as if we want to marry init + kernel
>>> we need a smooth way for us to discuss design without getting worked
>>> up about it, or taking it personal. I really want this to work as I
>>> personally like systemd so far.
>>
>> How about this: keep the timeout global, but also introduce a
>> (relatively short, say 10 or 15 seconds) timeout after which a warning
>> is printed.
>
> That is something that I originally was looking forward to on systemd,
> but here's the thing once that warning comes up  -- what would we do
> with it?

Short term: bump the timeout further. Long-term, hopefully the driver
(core) can be changed to avoid the problem.

> This patch addresses this warning in-kernel and the idea was
> that we'd then peg an async_probe bool as true on the driver as a fix,
> that was decided to be silly given all the above. These drivers are
> actually not misbehaving and it would be even more incorrect to try to
> "fix" them by making them run asynchronously. In fact for some old
> storage drivers it may even be the worst thing to do given the
> possible slew of userland deployment and scripts which assume things
> *are* synchronous.

As mentioned above, it probably makes sense to switch on the
asynchronous behaviour only for a given call to finit_module(), and
not globally to avoid problems with userland assumptions.

>> Even if nothing is actually killed, having workers (be it
>> insmod or something else) take longer than a couple of seconds is
>> likely a sign that something is seriously off somewhere.
>
> Probe can take a long time and that's fine,

But isn't finit_module() taking a long time a serious problem given
that it means no other module can be loaded in parallel? Even if you
have some storage device which legitimately needs to take a couple of
minutes to probe, you probably still want your computer to boot and
get on with its other tasks whilst you wait... Or worse still, some
insignificant driver is broken and simply hangs in probe, but surely
you still want the rest of the system to boot?

> so for kmod the current
> assumption is flawed. If we had an option to async probe all drivers
> then perhaps this kmod timeout *might be reasonable*, and even then I
> do recommend for a clear warning that can be collected on logs on its
> first iteration rather than sigkilling, only after a whlie should
> sigkilling be done really. If systemd can deal with module loading in
> the background for drivers that take a long time and warning on that
> intsead of sigkiling it may be good start prior to enabling a default
> sigkill on drivers. This is perhaps also true for other workers but
> its not clear if this is a reasonable strategy for systemd.