[systemd-devel] [PATCH] udev: warn instead of killing kmod loading

Tom Gundersen teg at jklm.no
Thu Sep 11 14:50:19 PDT 2014


On Thu, Sep 11, 2014 at 11:02 PM, Luis R. Rodriguez <mcgrof at suse.com> wrote:
> On Thu, Sep 11, 2014 at 1:07 PM, Tom Gundersen <teg at jklm.no> wrote:
>> On Mon, Aug 11, 2014 at 7:19 PM, Luis R. Rodriguez <mcgrof at suse.com> wrote:
>>> On Mon, Aug 11, 2014 at 12:57 PM, Lennart Poettering
>>> <lennart at poettering.net> wrote:
>>>> On Mon, 11.08.14 18:39, Luis R. Rodriguez (mcgrof at suse.com) wrote:
>>>>
>>>>> > This looks really wrong. We shouldn't permit worker processes to be
>>>>> > blocked indefinitely without any timeout applied. Designing a worker
>>>>> > process system like that is simply wrong. It's one thing to allow
>>>>> > changing the specific timeout applied, it's a very different thing to
>>>>> > allow broken drivers to completely stall the worker process logic.
>>>>>
>>>>> OK what if we enable customizations then on the timeout by the built-in
>>>>> cmd type and we use a high multiplier for now for kmod ? A multiplier
>>>>> for kmod of 10 would set the kmod timeout to 5 minutes then, as we
>>>>> sweep up and clean drivers we can reduce this over time. This would also
>>>>> enable us to keep the default timeout for the other type of workers.
>>>>
>>>> Why this complexity?
>>>>
>>>> I mean, it sounds much simpler to simply increase the default timeout a
>>>> bit, if it turns out to be too low for the current cases...
>>>
>>> True, there's two things here and one of which this v2 patch didn't address:
>>>
>>> 1) It'd be good for defaults on systemd to work on most systems based
>>> on upstream kernels today, right now folks deploying systemd would
>>> need to modify the default timeout. Are we up to bump the default up
>>> considerably? If its high, would that be unfair for classes of workers
>>> we know shouldn't take that long, or wouldn't that allow folks
>>> developing new workers to take longer?
>>>
>>> 2) We want chatty logs to allow us to keep track of drivers that need
>>> attention. Ideally right now we should strive for 30 seconds init and
>>> work on asynching most work, we want to do this in a non fatal way.
>>> Overriding the timeout won't let us to keep track of buggy drivers
>>> that need love from systemd's perspective to stay within the 30 second
>>> bound time. We can have chatty logs from the kernel but using a
>>> timeout on the driver core seems a bit overkill specially if systemd
>>> is already keeping track of driver's init time, so it'd be better if
>>> we could collect offending drivers from systemd. I could have
>>> implemented support for this in this v2 patch by simply adding another
>>> check using the default timeout.
>>
>> Hi Luis,
>>
>> Just following up on this "old" thread: I have now bumped the default
>> kill timeout (for all workers) to 180 seconds,
>
> Thanks Tom, this should help quite a lot while we try to get in the
> kernel what systemd had assumed was there which is async probe for all
> drivers. I'll poke back once we have that option available on the
> kernel and that should help mitigate other issues I suppose.

Great!

>> and added a warning
>> which is triggered after a third of the timeout.
>
> This is great! What commit merged this?I just looked at the latest
> commits and couldn't find one associated with this. I also checked the
> mailing list for pending patches.

Oops, my push failed. Just pushed now:
<http://cgit.freedesktop.org/systemd/systemd/commit/?id=671174136525ddf208cdbe75d6d6bd159afa961f>.

> Also what would we use to hunt for
> these? Would they show up on:
>
> journalctl -b -0 -u systemd-udevd

Precisely. Generated by making processing hang for all network devices:

Sep 11 21:26:58 tomegun-x240 systemd-udevd[404]: worker [406]
/devices/virtual/net/lo is taking a long time
Sep 11 21:26:58 tomegun-x240 systemd-udevd[404]: worker [407]
/devices/pci0000:00/0000:00:1c.1/0000:03:00.0/net/wlan0 is taking a
long time
Sep 11 21:26:58 tomegun-x240 systemd-udevd[404]: worker [411]
/devices/pci0000:00/0000:00:19.0/net/enp0s25 is taking a long time
Sep 11 21:26:58 tomegun-x240 systemd-udevd[404]: worker [412]
/devices/pci0000:00/0000:00:14.0/usb2/2-4/2-4:1.6/net/wwan0 is taking
a long time
Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [406]
/devices/virtual/net/lo timeout; kill it
Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: seq 2214
'/devices/virtual/net/lo' killed
Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [407]
/devices/pci0000:00/0000:00:1c.1/0000:03:00.0/net/wlan0 timeout; kill
it
Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: seq 2502
'/devices/pci0000:00/0000:00:1c.1/0000:03:00.0/net/wlan0' killed
Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [411]
/devices/pci0000:00/0000:00:19.0/net/enp0s25 timeout; kill it
Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: seq 2015
'/devices/pci0000:00/0000:00:19.0/net/enp0s25' killed
Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [412]
/devices/pci0000:00/0000:00:14.0/usb2/2-4/2-4:1.6/net/wwan0 timeout;
kill it
Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: seq 2455
'/devices/pci0000:00/0000:00:14.0/usb2/2-4/2-4:1.6/net/wwan0' killed
Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [406]
terminated by signal 9 (Killed)
Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [407]
terminated by signal 9 (Killed)
Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [411]
terminated by signal 9 (Killed)
Sep 11 21:28:58 tomegun-x240 systemd-udevd[404]: worker [412]
terminated by signal 9 (Killed)

> I ask as when I tried to add something similar I was unable to capture
> anything with the above. We'd use this to then monitor for these prior
> to assuming other issues then.
>
>> Let me know if this covers what you need, or if there is anything else
>> we should do on the systemd side.
>
> Thanks! I realize that at this point perhaps removing the sigkill may
> not be possible at all given that this is part of an original design
> consideration but if we can at least address all other possible
> concerns this would be great. I know for instance James has concerns
> with any arbitrary timeout in general, but perhaps he can elaborate
> more on that. Also Jiri made this one point about reboot and the
> timeout:
>
>>>> This aproach is actually introducing a user-visible regressions. Look, for
>>>> example, exec() never times out. Therefore if your system is on its knees,
>>>> heavily overloaded (or completely broken), you are likely to be able to
>>>> `reboot' it, because exec("/sbin/reboot") ultimately succeeds.
>>>>
>>>> But with all the timeouts, dbus, "Failed to issue method call: Did
>>>> not receive a reply" messages, this is getting close to impossible.


More information about the systemd-devel mailing list