How are Jenkins builds killed exactly?
Stephan Bergmann
sbergman at redhat.com
Sun Dec 29 14:17:17 UTC 2019
Still trying to track down why sometimes zombie processes survive on the
(Linux) Jenkins build machines (and then make later, unrelated Jenkins
builds on those machines fail when zombie soffice.bin processes still
hold onto named pipes that tests from the new builds want to create too).
One such recent case on tb79 was the aborted
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/>. It
left behind a zombie python.bin -> oosplash -> soffice.bin process tree
executing UITest_calc_tests3. (Where presumably the soffice.bin process
had deadlocked, which then caused the Jenkins
> Build timed out (after 15 minutes). Marking the build as aborted.
> Build was aborted
> Finished: ABORTED
reaction. But once I noticed, the images of the involved processes had
already been overwritten by later builds, so I couldn't use gdb to get
backtraces.)
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/consoleFull>
shows that some entity runs lode's tb_slave_wrapper as (the main) part
of the build, see
> [linux_clang_dbgutil_64] $ /bin/sh -xe /tmp/jenkins3389683698813990355.sh
> + /home/tdf/lode/bin/tb_slave_wrapper --real --mode=config --clean
That tb_slave_wrapper script contains
> trap cleanup 1 2 3 6 15
>
> cleanup()
> {
> echo "Caught Signal ... killing everything...."
> # kill everything in same process group (pseudo-pid 0)
> kill -9 0
> }
intended to kill all processes if the script itself receives any of
SIGHUP/-INT/-QUIT/-ABRT/-TERM.
But how does the tb_slave_wrapper script get terminated by whatever
entity that starts it and prints out the
> Build timed out (after 15 minutes). Marking the build as aborted.
> Build was aborted
> Finished: ABORTED
mentioned above? Could it be that the script itself gets killed with
SIGKILL, so its cleanup() trap doesn't fire, and processes (indirectly)
spawned from the script may stay alive?
Interestingly, the output from the above
> echo "Caught Signal ... killing everything...."
doesn't show up anywhere in
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/consoleFull>
(supporting the theory that cleanup() doesn't run), while other output
that apparently stems from similar echo/printf commands in that script
does show up there, see
> OS:
> pwd:/home/tdf/lode/jenkins/workspace/lo_gerrit/Config/linux_clang_dbgutil_64
> config mode : linux_clang_dbgutil_64
> Taking configuration values from ./distro-configs/Jenkins/linux_clang_dbgutil_64
More information about the LibreOffice
mailing list