[systemd-devel] Diagnosing hang on reboot

Daniel Drake dsd at laptop.org
Wed Jul 4 10:46:23 PDT 2012


Hi,

We're encountering a systemd hang on reboot which is proving hard to
debug, on the OLPC XO platform (systemd-44 on Fedora 17). It doesn't
happen every time, but it is frequent: when running a system that
reboots once every 2-3 minutes, it reproduces with an hour (usually
much quicker). Can anyone suggest debugging techniques for the
following situation, or are there similar-sounding bug reports already
that might provide clues?

- /sbin/reboot is run, and exits with code 0, without producing any
output on stderr or stdout.

- the reboot process is definitely initiated, because plymouth's
shutdown screen comes up, and the serial console getty is stopped

- the hang happens with the plymouth shutdown splash on-screen, and
the system continues responding to keypresses (showing/hiding the
plymouth splash)

- disabling the plymouth shutdown splash doesn't solve the hang, and
no interesting messages appear on the console either

- the system no longer responds to sysrq over serial (even when the
kernel sysrq_always_enabled parameter is used)

- the shutdown scripts in /usr/lib/systemd/system-shutdown are not called

- enabling systemd debugging via kernel parameters
"systemd.log_level=debug systemd.log_target=kmsg" causes the hang not
to happen (left a system reboot-looping with this configuration for 24
hours without hitting the issue)

Any tips appreciated.


This is perhaps unlikely to be a systemd issue, because when we reboot
from a "normal" session, we don't hit this issue (but I think systemd
could help us find the problem?). We hit this issue when rebooting
after running our manufacturing tests, which aim to hammer the system
very hard and activate as many components as possible (microphone,
camera, screen, disk, RAM check, ...). These tests are activated as
follows:

 1. During boot, runin-check.service (runs early) notes that the
laptop's manufacturing data says that the system should run
manufacturing tests rather than starting a real session. The
runin-check program then calls "systemctl isolate runin.target"
 2. runin.target starts the runin-main program which opens an X
session and kicks off all kinds of tests

Here are the debug logs from a successful boot-to-reboot cycle (when
things work OK):
http://dev.laptop.org/~dsd/20120704/runin-verbose.txt
At 15.991475, runin-check runs "systemctl isolate runin.target"
At 18.969571, runin tests start
At 30.505676, runin tests fail and the reboot process is initiated. (I
deliberately triggered the fail so that I don't have to wait a long
time for the reboot to happen)
At 36.818280, "/sbin/reboot" is called by runin
At 46.956082 the scripts in /usr/lib/systemd/system-shutdown are called


Here are the relevant service/target files:

runin-check.service:

[Unit]
Description=Check whether to run OLPC run-in tests
DefaultDependencies=no
Requires=olpc-configure.service
After=olpc-configure.service
Before=basic.target

[Service]
Type=oneshot
ExecStart=/runin/runin-check

[Install]
WantedBy=basic.target




runin.target:

[Unit]
Description=OLPC run-in tests
AllowIsolate=true
DefaultDependencies=no
Requires=runin.service
After=olpc-configure.service
Wants=plymouth-quit.service plymouth-quit-wait.service




runin.service:


[Unit]
Description=OLPC run-in tests
DefaultDependencies=no
Wants=udev-settle.service
After=udev-settle.service plymouth-quit.service plymouth-quit-wait.service

[Service]
ExecStart=/runin/runin-main


Any help appreciated; this is currently the last blocking bug we have
preventing our latest software image (our first systemd-based
release!) from entering mass-production in the factory.

Thanks!
Daniel


More information about the systemd-devel mailing list