[systemd-devel] is the watchdog useful?

Vito Caputo vcaputo at pengaru.com
Fri Oct 25 18:16:07 UTC 2019


On Fri, Oct 25, 2019 at 10:26:44AM +0000, Zbigniew Jędrzejewski-Szmek wrote:
> On Thu, Oct 24, 2019 at 02:56:55PM -0700, Vito Caputo wrote:
> > On Thu, Oct 24, 2019 at 10:45:32AM +0000, Zbigniew Jędrzejewski-Szmek wrote:
> > > On Tue, Oct 22, 2019 at 04:35:13AM -0700, Vito Caputo wrote:
> > > > On Tue, Oct 22, 2019 at 10:51:49AM +0000, Zbigniew Jędrzejewski-Szmek wrote:
> > > > > On Tue, Oct 22, 2019 at 12:34:45PM +0200, Umut Tezduyar Lindskog wrote:
> > > > > > I am curious Zbigniew of how you find out if the coredump was on a starved
> > > > > > process?
> > > > > 
> > > > > A very common case is systemd-journald which gets SIGABRT when in a
> > > > > read() or write() or similar syscall. Another case is when
> > > > > systemd-udevd workers get ABRT when doing open() on a device.
> > > > > 
> > > > 
> > > > In the case of journald, is it really in read()/write() syscalls you're
> > > > seeing the SIGABRTs?
> > > 
> > > I was sloppy here — it's not read/write, but various other syscalls.
> > > In particular clone(), which makes sense, because it involves memory
> > > allocation.
> > > 
> > 
> > That's interesting, it's not like journald calls clone() a lot. 
> 
> Hm, maybe it was udevd that was calling clone(), not journald.
> All the reports are available here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1300212
> 

Yeah, nearly all the 2019 reports you marked as duplicate were
udev/logind, at least of the ones I could access.  Surprisingly few were
journald related at all, are those the ones marked private?

> I opened a pull request to make the watchdog setting configurable
> for our own internal services: https://github.com/systemd/systemd/pull/13843.
> 

It will be interesting to see how many aborts continue to come in with
the timeout @ 1h.

Seeing aborts in epoll_wait() suggests these processes either weren't
getting scheduled to run for 3 minutes despite having watchdog events to
service on an fd, or were paged out and took longer than 3 minutes to
page in and actually run.

If it's the latter, maybe we can lock some of these in memory in
response to another build-time configuration.  Desktop distros can spare
the memory for logind to not get paged out for instance.  I know I
wouldn't object to having the drm arbiter process always resident on my
machine, it's pretty close to being a kernel component in that role.

Regards,
Vito Caputo


More information about the systemd-devel mailing list