[systemd-devel] is the watchdog useful?

Pekka Paalanen ppaalanen at gmail.com
Tue Oct 22 09:41:02 UTC 2019


On Tue, 22 Oct 2019 09:17:42 +0000
Zbigniew Jędrzejewski-Szmek <zbyszek at in.waw.pl> wrote:

> On Tue, Oct 22, 2019 at 09:54:31AM +0300, Pekka Paalanen wrote:
> > On Mon, 21 Oct 2019 17:50:44 +0000
> > Zbigniew Jędrzejewski-Szmek <zbyszek at in.waw.pl> wrote:
> >   
> > > In principle, the watchdog for services is nice. But in practice it seems
> > > be bring only grief. The Fedora bugtracker is full of automated reports of ABRTs,
> > > and of those that were fired by the watchdog, pretty much 100% are bogus, in 
> > > the sense that the machine was resource starved and the watchdog fired.  
> > 
> > Hi,
> > 
> > just curious, is that resource starvation caused by something big, e.g.
> > a browser, using too much memory which leads to the kernel reclaiming
> > also pages of program text sections because they can be reloaded from
> > disk at any time, however those pages are needed again immediately
> > after when some CPU core switches process context, leading to something
> > that looks like a hard freeze to a user, while the kernel is furiously
> > loading pages from disk just to drop them again, and can take from
> > minutes to hours before any progress is visible?  
> 
> I don't really know. Unfortunately, abrt in Fedora does not collect log
> messages. In the old syslog days, a snippet of /var/log/messages for the
> last 20 minutes or something like that before a crash would be copied
> into the bug report, and this would include kernel messages about disk
> errors, or kernel stalls, or other interesting hints. Unfortunately
> nowadays, because of privacy concerns (?) and an effort to make things
> more efficient (?), just some heavily-filtered journalctl output is
> attached. In practice, usually this is at most a few lines and
> completely useless. In particular, it does not give any hints to the
> overall state of the system.
> 
> I have spoken to abrt maintainers about this, but it seems that this
> problem is specific to systemd, and for most other applications it is
> OK to get a backtrace without any system-wide context. So I don't see
> this changing any time soon ;(
> 
> Sometimes I ask people for logs, and sometimes I get them, and in those
> cases it seems that both hardware issues (e.g. a failing disk), or memory
> exhaustion are often involved. In some cases there is no clear reason.
> And since in the great majority we don't have any logs, it is hard to
> say anything.
> 
> > It has happened to me on Fedora in the past. I could probably dig up
> > discussions about the problem in general if you want, they explain it
> > better than I ever could.
> > 
> > Does Fedora prevent that situation by tuning some kernel knobs nowadays
> > for desktops?  
> 
> I don't think so.

For the case I described, I don't think there are really any error
messages produced, unless OOM killer finally manages to kill something.
That may take half an hour to finally kick in, or even longer. The
kernel is running just fine but userspace makes little to no progress.

Personally I was just lucky to reproduce the issue while iotop was
running and I saw 100% disk I/O used for purely reading, soon after
everything just ground to halt.

Maybe that could explain some of the reports.


Thanks,
pq
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/systemd-devel/attachments/20191022/741da922/attachment-0001.sig>


More information about the systemd-devel mailing list