[systemd-devel] Troubleshooting Failed Nspawn Starts

Fri Aug 14 10:30:36 PDT 2015

On Mon, 10.08.15 08:03, Rich Freeman (r-systemd at thefreemanclan.net) wrote:

> Occassionally I'll have nspawn containers that freeze up when they're
> loading.  What is the best way to troubleshoot these and get useful
> info to devs?

Well, not sure what "freeze" means here... I'd always start by getting
a stack trace of the processes tha hang. Try the "pstack" tool on the
processes to get a backtrace.

> This is on systemd-218, on Gentoo.

Upstream we try to focus on very recent systemd only...

> Also, is there any way to detect these freezes, perhaps getting the
> service launching it to at least fail?  Short of installing nagios/etc
> something like this is hard to spot right now.

We have watchdog (see WatchdogSec= documentation in
systemd.service(5)) support in all our long-running daemons, and PID 1
will kill the service and generate a backtrace for them if they don't
send a watchdog message often enough. So actually we should be pretty
good here...

> Example of a frozen container:
> 
> systemctl status mariadb-contain
> ● mariadb-contain.service - mariadb container
>    Loaded: loaded (/etc/systemd/system/mariadb-contain.service;
> enabled; vendor preset: enabled)
>    Active: active (running) since Mon 2015-08-10 07:21:48 EDT; 37min ago
>      Docs: man:systemd-nspawn(1)
>  Main PID: 1033 (systemd-nspawn)
>    Status: "Container running."
>    CGroup: /system.slice/mariadb-contain.service
>            ├─1033 /usr/bin/systemd-nspawn --quiet --keep-unit --boot
> --link-journal=guest --directory=/sstorage3/cont...
>            ├─1044 /usr/lib/systemd/systemd
>            └─system.slice
>              ├─systemd-journald.service
>              │ └─1407 /usr/lib/systemd/systemd-journald
>              └─systemd-journal-flush.service
>                └─1340 /usr/bin/journalctl --flush

Hmm, this is really weird... Would be good to get a backtrac of both
journald and journalctl here. Note that journald has a much higher PID
that journalctl though, which indicates that it might have gotten
restarted by systemd already...

journalctl --flush actually pretty much only sends SIGUSR1 to
journald, but does this through PID1's bus APIs... It then waits for a
file in /run/systemd/journal/flushed to appear... For some reason that
doesn't work here... Weird...

Anyway, before tracking this down further, could you update to a more
recent systemd version?

Lennart

-- 
Lennart Poettering, Red Hat