How are Jenkins builds killed exactly?

Stephan Bergmann sbergman at redhat.com
Sun Dec 29 14:17:17 UTC 2019


Still trying to track down why sometimes zombie processes survive on the 
(Linux) Jenkins build machines (and then make later, unrelated Jenkins 
builds on those machines fail when zombie soffice.bin processes still 
hold onto named pipes that tests from the new builds want to create too).

One such recent case on tb79 was the aborted 
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/>.  It 
left behind a zombie python.bin -> oosplash -> soffice.bin process tree 
executing UITest_calc_tests3.  (Where presumably the soffice.bin process 
had deadlocked, which then caused the Jenkins

> Build timed out (after 15 minutes). Marking the build as aborted.
> Build was aborted
> Finished: ABORTED

reaction.  But once I noticed, the images of the involved processes had 
already been overwritten by later builds, so I couldn't use gdb to get 
backtraces.)

<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/consoleFull> 
shows that some entity runs lode's tb_slave_wrapper as (the main) part 
of the build, see

> [linux_clang_dbgutil_64] $ /bin/sh -xe /tmp/jenkins3389683698813990355.sh
> + /home/tdf/lode/bin/tb_slave_wrapper --real --mode=config --clean

That tb_slave_wrapper script contains

> trap cleanup 1 2 3 6 15
> 
> cleanup()
> {
>   echo "Caught Signal ... killing everything...."
>   # kill everything in same process group (pseudo-pid 0)
>   kill -9 0
> }

intended to kill all processes if the script itself receives any of 
SIGHUP/-INT/-QUIT/-ABRT/-TERM.

But how does the tb_slave_wrapper script get terminated by whatever 
entity that starts it and prints out the

> Build timed out (after 15 minutes). Marking the build as aborted.
> Build was aborted
> Finished: ABORTED

mentioned above?  Could it be that the script itself gets killed with 
SIGKILL, so its cleanup() trap doesn't fire, and processes (indirectly) 
spawned from the script may stay alive?

Interestingly, the output from the above

>   echo "Caught Signal ... killing everything...."

doesn't show up anywhere in 
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/consoleFull> 
(supporting the theory that cleanup() doesn't run), while other output 
that apparently stems from similar echo/printf commands in that script 
does show up there, see

> OS:
> pwd:/home/tdf/lode/jenkins/workspace/lo_gerrit/Config/linux_clang_dbgutil_64
> config mode : linux_clang_dbgutil_64
> Taking configuration values from ./distro-configs/Jenkins/linux_clang_dbgutil_64



More information about the LibreOffice mailing list