[systemd-devel] Troubleshooting Failed Nspawn Starts
Lennart Poettering
mzerqung at 0pointer.de
Fri Aug 14 10:30:36 PDT 2015
On Mon, 10.08.15 08:03, Rich Freeman (r-systemd at thefreemanclan.net) wrote:
> Occassionally I'll have nspawn containers that freeze up when they're
> loading. What is the best way to troubleshoot these and get useful
> info to devs?
Well, not sure what "freeze" means here... I'd always start by getting
a stack trace of the processes tha hang. Try the "pstack" tool on the
processes to get a backtrace.
> This is on systemd-218, on Gentoo.
Upstream we try to focus on very recent systemd only...
> Also, is there any way to detect these freezes, perhaps getting the
> service launching it to at least fail? Short of installing nagios/etc
> something like this is hard to spot right now.
We have watchdog (see WatchdogSec= documentation in
systemd.service(5)) support in all our long-running daemons, and PID 1
will kill the service and generate a backtrace for them if they don't
send a watchdog message often enough. So actually we should be pretty
good here...
> Example of a frozen container:
>
> systemctl status mariadb-contain
> ● mariadb-contain.service - mariadb container
> Loaded: loaded (/etc/systemd/system/mariadb-contain.service;
> enabled; vendor preset: enabled)
> Active: active (running) since Mon 2015-08-10 07:21:48 EDT; 37min ago
> Docs: man:systemd-nspawn(1)
> Main PID: 1033 (systemd-nspawn)
> Status: "Container running."
> CGroup: /system.slice/mariadb-contain.service
> ├─1033 /usr/bin/systemd-nspawn --quiet --keep-unit --boot
> --link-journal=guest --directory=/sstorage3/cont...
> ├─1044 /usr/lib/systemd/systemd
> └─system.slice
> ├─systemd-journald.service
> │ └─1407 /usr/lib/systemd/systemd-journald
> └─systemd-journal-flush.service
> └─1340 /usr/bin/journalctl --flush
Hmm, this is really weird... Would be good to get a backtrac of both
journald and journalctl here. Note that journald has a much higher PID
that journalctl though, which indicates that it might have gotten
restarted by systemd already...
journalctl --flush actually pretty much only sends SIGUSR1 to
journald, but does this through PID1's bus APIs... It then waits for a
file in /run/systemd/journal/flushed to appear... For some reason that
doesn't work here... Weird...
Anyway, before tracking this down further, could you update to a more
recent systemd version?
Lennart
--
Lennart Poettering, Red Hat
More information about the systemd-devel
mailing list