[systemd-devel] [PATCH] udev: warn instead of killing kmod loading

Thu Sep 11 14:02:25 PDT 2014

On Thu, Sep 11, 2014 at 1:07 PM, Tom Gundersen <teg at jklm.no> wrote:
> On Mon, Aug 11, 2014 at 7:19 PM, Luis R. Rodriguez <mcgrof at suse.com> wrote:
>> On Mon, Aug 11, 2014 at 12:57 PM, Lennart Poettering
>> <lennart at poettering.net> wrote:
>>> On Mon, 11.08.14 18:39, Luis R. Rodriguez (mcgrof at suse.com) wrote:
>>>
>>>> > This looks really wrong. We shouldn't permit worker processes to be
>>>> > blocked indefinitely without any timeout applied. Designing a worker
>>>> > process system like that is simply wrong. It's one thing to allow
>>>> > changing the specific timeout applied, it's a very different thing to
>>>> > allow broken drivers to completely stall the worker process logic.
>>>>
>>>> OK what if we enable customizations then on the timeout by the built-in
>>>> cmd type and we use a high multiplier for now for kmod ? A multiplier
>>>> for kmod of 10 would set the kmod timeout to 5 minutes then, as we
>>>> sweep up and clean drivers we can reduce this over time. This would also
>>>> enable us to keep the default timeout for the other type of workers.
>>>
>>> Why this complexity?
>>>
>>> I mean, it sounds much simpler to simply increase the default timeout a
>>> bit, if it turns out to be too low for the current cases...
>>
>> True, there's two things here and one of which this v2 patch didn't address:
>>
>> 1) It'd be good for defaults on systemd to work on most systems based
>> on upstream kernels today, right now folks deploying systemd would
>> need to modify the default timeout. Are we up to bump the default up
>> considerably? If its high, would that be unfair for classes of workers
>> we know shouldn't take that long, or wouldn't that allow folks
>> developing new workers to take longer?
>>
>> 2) We want chatty logs to allow us to keep track of drivers that need
>> attention. Ideally right now we should strive for 30 seconds init and
>> work on asynching most work, we want to do this in a non fatal way.
>> Overriding the timeout won't let us to keep track of buggy drivers
>> that need love from systemd's perspective to stay within the 30 second
>> bound time. We can have chatty logs from the kernel but using a
>> timeout on the driver core seems a bit overkill specially if systemd
>> is already keeping track of driver's init time, so it'd be better if
>> we could collect offending drivers from systemd. I could have
>> implemented support for this in this v2 patch by simply adding another
>> check using the default timeout.
>
> Hi Luis,
>
> Just following up on this "old" thread: I have now bumped the default
> kill timeout (for all workers) to 180 seconds,

Thanks Tom, this should help quite a lot while we try to get in the
kernel what systemd had assumed was there which is async probe for all
drivers. I'll poke back once we have that option available on the
kernel and that should help mitigate other issues I suppose.

> and added a warning
> which is triggered after a third of the timeout.

This is great! What commit merged this? I just looked at the latest
commits and couldn't find one associated with this. I also checked the
mailing list for pending patches. Also what would we use to hunt for
these? Would they show up on:

journalctl -b -0 -u systemd-udevd

I ask as when I tried to add something similar I was unable to capture
anything with the above. We'd use this to then monitor for these prior
to assuming other issues then.

> Let me know if this covers what you need, or if there is anything else
> we should do on the systemd side.

Thanks! I realize that at this point perhaps removing the sigkill may
not be possible at all given that this is part of an original design
consideration but if we can at least address all other possible
concerns this would be great. I know for instance James has concerns
with any arbitrary timeout in general, but perhaps he can elaborate
more on that. Also Jiri made this one point about reboot and the
timeout:

>>> This aproach is actually introducing a user-visible regressions. Look, for
>>> example, exec() never times out. Therefore if your system is on its knees,
>>> heavily overloaded (or completely broken), you are likely to be able to
>>> `reboot' it, because exec("/sbin/reboot") ultimately succeeds.
>>>
>>> But with all the timeouts, dbus, "Failed to issue method call: Did
>>> not receive a reply" messages, this is getting close to impossible.

Thanks again.

  Luis