[systemd-devel] some services always being killed when stress tests running
Han Pingtian
hanpt at linux.vnet.ibm.com
Tue Mar 22 02:02:55 UTC 2016
Hi,
We are running some stress tests as a systemd service by using STAF:
% sudo systemctl status staf.service
● staf.service - Start staf after boot
Loaded: loaded (/etc/systemd/system/staf.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2016-03-21 20:54:38 CDT; 12s ago
Process: 16846 ExecStart=/usr/local/staf/tools/start.staf (code=exited, status=0/SUCCESS)
Main PID: 16890 (STAFProc)
Tasks: 83 (limit: 512)
CGroup: /system.slice/staf.service
├─16890 /usr/local/staf/bin/STAFProc
├─16946 sh -c "java" -Xmx1024m com.ibm.staf.service.STAFServiceHelper STAFJVM1
└─16947 java -Xmx1024m com.ibm.staf.service.STAFServiceHelper STAFJVM1
Mar 21 20:54:38 pinelp3 start.staf[16846]: + '[' 2 -gt 19 ']'
Mar 21 20:54:38 pinelp3 start.staf[16846]: + '[' 0 = 0 ']'
Mar 21 20:54:38 pinelp3 start.staf[16846]: + STAF local STAX list jobs
Mar 21 20:54:38 pinelp3 start.staf[16846]: + grep Response
Mar 21 20:54:38 pinelp3 start.staf[16846]: + '[' 0 = 1 ']'
Mar 21 20:54:38 pinelp3 start.staf[16846]: + started=1
Mar 21 20:54:38 pinelp3 start.staf[16846]: + '[' 1 = 0 ']'
Mar 21 20:54:38 pinelp3 start.staf[16846]: + echo STAFProc running
Mar 21 20:54:38 pinelp3 start.staf[16846]: STAFProc running
Mar 21 20:54:38 pinelp3 systemd[1]: Started Start staf after boot.
But only after about 30 minutes, a lot of systemd services failed
and restarted like this:
... ...
[26885.910036] systemd[1]: systemd-journald.service: Failed with result 'signal'.
[26885.910218] systemd[1]: systemd-udevd.service: Main process exited, code=killed, status=9/KILL
[26885.910614] systemd[1]: systemd-udevd.service: Unit entered failed state.
[26885.910639] systemd[1]: systemd-udevd.service: Failed with result 'signal'.
[26885.931954] systemd[1]: lvm2-lvmetad.service: Main process exited, code=killed, status=9/KILL
[26885.932365] systemd[1]: lvm2-lvmetad.service: Unit entered failed state.
[26885.932385] systemd[1]: lvm2-lvmetad.service: Failed with result 'signal'.
[26885.932484] systemd[1]: accounts-daemon.service: Main process exited, code=killed, status=9/KILL
[26885.932898] systemd[1]: accounts-daemon.service: Unit entered failed state.
[26885.932914] systemd[1]: accounts-daemon.service: Failed with result 'signal'.
[26885.933004] systemd[1]: cron.service: Main process exited, code=killed, status=9/KILL
[26885.933310] systemd[1]: cron.service: Unit entered failed state.
[26885.933325] systemd[1]: cron.service: Failed with result 'signal'.
[26885.933407] systemd[1]: rsyslog.service: Main process exited, code=killed, status=9/KILL
[26885.933817] systemd[1]: rsyslog.service: Unit entered failed state.
[26885.933836] systemd[1]: rsyslog.service: Failed with result 'signal'.
[26885.933924] systemd[1]: systemd-logind.service: Main process exited, code=killed, status=9/KILL
[26885.934339] systemd[1]: systemd-logind.service: Unit entered failed state.
[26885.934355] systemd[1]: systemd-logind.service: Failed with result 'signal'.
[26885.934462] systemd[1]: dbus.service: Main process exited, code=killed, status=9/KILL
[26885.979836] systemd[1]: iprdump.service: Main process exited, code=killed, status=9/KILL
[26885.980310] systemd[1]: iprdump.service: Unit entered failed state.
[26885.980361] systemd[1]: iprdump.service: Failed with result 'signal'.
[26885.980536] systemd[1]: iprinit.service: Main process exited, code=killed, status=9/KILL
[26885.980878] systemd[1]: iprinit.service: Unit entered failed state.
[26885.980888] systemd[1]: iprinit.service: Failed with result 'signal'.
[26885.980953] systemd[1]: iprupdate.service: Main process exited, code=killed, status=9/KILL
[26885.981227] systemd[1]: iprupdate.service: Unit entered failed state.
[26885.981237] systemd[1]: iprupdate.service: Failed with result 'signal'.
[26885.981293] systemd[1]: rtas_errd.service: Main process exited, code=killed, status=9/KILL
[26885.981566] systemd[1]: rtas_errd.service: Unit entered failed state.
[26885.981576] systemd[1]: rtas_errd.service: Failed with result 'signal'.
[26885.981635] systemd[1]: ssh.service: Main process exited, code=killed, status=9/KILL
[26885.981852] systemd[1]: ssh.service: Unit entered failed state.
[26885.981861] systemd[1]: ssh.service: Failed with result 'signal'.
[26885.981925] systemd[1]: polkitd.service: Main process exited, code=killed, status=9/KILL
[26885.982199] systemd[1]: polkitd.service: Unit entered failed state.
[26885.982208] systemd[1]: polkitd.service: Failed with result 'signal'.
[26885.982306] systemd[1]: postgresql at 9.5-main.service: Main process exited, code=killed, status=9/KILL
[26886.031683] systemd[1]: staf.service: Main process exited, code=killed, status=9/KILL
[26886.031976] systemd[1]: staf.service: Unit entered failed state.
[26886.031987] systemd[1]: staf.service: Failed with result 'signal'.
[26886.052504] systemd[1]: dbus.service: Unit entered failed state.
[26886.052533] systemd[1]: dbus.service: Failed with result 'signal'.
[26886.055613] systemd[1]: rsyslog.service: Service hold-off time over, scheduling restart.
[26886.056711] systemd[1]: getty at tty1.service: Service has no hold-off time, scheduling restart.
[26886.057801] systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart.
[26886.058871] systemd[1]: systemd-udevd.service: Service has no hold-off time, scheduling restart.
[26886.058949] systemd[1]: lvm2-lvmetad.service: Service hold-off time over, scheduling restart.
[26886.058979] systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart.
[26886.096963] systemd[1]: ssh.service: Service hold-off time over, scheduling restart.
[26886.145441] systemd[1]: serial-getty at hvc0.service: Service hold-off time over, scheduling restart.
[26886.187508] systemd[1]: Stopped Serial Getty on hvc0.
[26886.211419] systemd[1]: Started Serial Getty on hvc0.
[26886.211642] systemd[1]: Stopped OpenBSD Secure Shell server.
[26886.243382] systemd[1]: Starting OpenBSD Secure Shell server...
[26886.243800] systemd[1]: Stopped Flush Journal to Persistent Storage.
[26886.243825] systemd[1]: Stopping Flush Journal to Persistent Storage...
[26886.243943] systemd[1]: Stopped Journal Service.
[26886.244540] systemd[1]: Starting Journal Service...
[26886.244763] systemd[1]: Stopped LVM2 metadata daemon.
[26886.245455] systemd[1]: Started LVM2 metadata daemon.
[26886.245686] systemd[1]: Stopped udev Kernel Device Manager.
[26886.246240] systemd[1]: Starting udev Kernel Device Manager...
... ...
The systemd's version is 229, kernel version is 4.5+. We use the
default TasksMax=512. There is no any other useful message about
what happened. How could we figure out and fix this problem?
Thanks in advance!
More information about the systemd-devel
mailing list