How are Jenkins builds killed exactly?

Stephan Bergmann sbergman at redhat.com
Wed Jan 8 15:23:21 UTC 2020


On 29/12/2019 15:17, Stephan Bergmann wrote:
> Still trying to track down why sometimes zombie processes survive on the 
> (Linux) Jenkins build machines (and then make later, unrelated Jenkins 
> builds on those machines fail when zombie soffice.bin processes still 
> hold onto named pipes that tests from the new builds want to create too).
> 
> One such recent case on tb79 was the aborted 
> <https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/49895/>.  It 
> left behind a zombie python.bin -> oosplash -> soffice.bin process tree 
> executing UITest_calc_tests3.  (Where presumably the soffice.bin process 
> had deadlocked, which then caused the Jenkins
> 
>> Build timed out (after 15 minutes). Marking the build as aborted.
>> Build was aborted
>> Finished: ABORTED
> 
> reaction.  But once I noticed, the images of the involved processes had 
> already been overwritten by later builds, so I couldn't use gdb to get 
> backtraces.)

I think I now understand what's going on:  Assume some UITest hangs with 
a deadlock in soffice.bin.  <https://plugins.jenkins.io/build-timeout> 
will kick in after the specified timeout of 900s to kill the build.

The relevant process tree is

> java───sh───tb-slave-wrapper─┬─make───make───sh───sh───python.bin───oosplash───soffice.bin
>                              └─tee

where:

(1) java is running Jenkins' remoting.jar.

(2) The following sh is Jenkins' way of running

> ${LODE_HOME}/bin/tb_slave_wrapper --real --mode=config --clean

(as specified in "Build - Execute shell - Command" at 
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/configure>) 
via some `/bin/sh -xe /tmp/jenkins*.sh` intermediary.

(3) tb-slave-wrapper is running the 
<https://gerrit.libreoffice.org/plugins/gitiles/lode/+/af02c3a9564062b4d04e457275624a7a30ba2ba2/bin/tb_slave_wrapper> 
script, which at line 325 does

>         make ${keep_going} $target 2>&1 | tee -a ${build_log}

which explains...

(4) ...the first make and...

(5) the tee.

(6) The second make is due to gbuild calling make recursively.

(7) The following sh is running the UITest target's (heavily redacted) 
recipe line

> /bin/sh -c 'S=... && I=... && W=.. && rm -rf $W/UITest/... && mkdir -p $W/UITest/... && ... && ( TDOC=... /bin/sh $I/program/python $S/uitest/test_main.py ... || ( RET=$?; $S/solenv/bin/gdb-core-bt.sh ...))'

and...

(8) ...the following sh is running the

> TDOC=... /bin/sh $I/program/python $S/uitest/test_main.py ...

part in a subshell.

(9) python.bin is running LO's uitest/test_main.py, which...

(10) ...spawns the soffice script which then execs oosplash, which...

(11) ...spawns soffice.bin.

<https://ci.libreoffice.org/> says at the bottom "Jenkins ver. 2.212", 
so lets assume that 
<https://github.com/jenkinsci/jenkins/commit/13ab7b9909927b7afa31797097b6114bf22a2a41> 
"[maven-release-plugin] prepare release jenkins-2.212" contains the 
source code of the remoting.jar running on the slave.

I assume the relevant starting point for the killing of the Jenkins 
job's processes is 
<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/Launcher.java#L953>

>         @Override
>         public void kill(Map<String, String> modelEnvVars) throws InterruptedException {
>             ProcessTree.get().killAll(modelEnvVars);
>         }

which will try to kill all processes it finds that have inherited the 
BUILD_ID env var identifying the given Jenkins job.

<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L731>

>         public void killAll(@Nonnull Map<String, String> modelEnvVars) throws InterruptedException {
>             for (OSProcess p : this)
>                 if(p.hasMatchingEnvVars(modelEnvVars))
>                     p.killRecursively();
>         }

first read the /proc file system tree to get a list of process IDs (in 
the ProcfsUnix constructor at 
<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L741>), 
then iterates over the list to find processes with matching BUILD_ID and 
try to kill them.

This looks racy, in that a to-be-killed build can spawn further 
processes after /proc has been read to produce the list.  But that 
should not be relevant in our scenario of killing a hung build after a 
timeout, as that build will no longer be spawning new processes 
(everything but the deadlocked soffice.bin and its parent chain has 
already terminated).

However, what appears to be relevant here is that processes are killed 
in a random order (see below).

The actual process killing appears to take place at 
<https://github.com/jenkinsci/jenkins/blob/13ab7b9909927b7afa31797097b6114bf22a2a41/core/src/main/java/hudson/util/ProcessTree.java#L909>

>         public static void destroy(int pid) throws IllegalAccessException,
>                 InvocationTargetException {
>             if (JAVA8_DESTROY_PROCESS != null) {
>                 JAVA8_DESTROY_PROCESS.invoke(null, pid, false);
>             } else {
>                 final Optional handle = (Optional)JAVA_9_PROCESSHANDLE_OF.invoke(null, pid);
>                 if (handle.isPresent()) {
>                     JAVA_9_PROCESSHANDLE_DESTROY.invoke(handle.get());
>                 }
>             }
>         }
> 

For Java <= 8 it will call the internal 
java.lang.UNIXProcess.destroyProcess with force=false at 
<http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/5b5973c3db08/src/solaris/native/java/lang/UNIXProcess_md.c#l717>

> JNIEXPORT void JNICALL
> Java_java_lang_UNIXProcess_destroyProcess(JNIEnv *env,
>                                           jobject junk,
>                                           jint pid,
>                                           jboolean force)
> {
>     int sig = (force == JNI_TRUE) ? SIGKILL : SIGTERM;
>     kill(pid, sig);
> }

which thus sends a SIGTERM.  Similarly for Java >= 9 it will call the 
internal java.lang.ProcessHandle.destroy at 
<http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/share/classes/java/lang/ProcessHandle.java#l324> 
(rather than the companion java.lang.ProcessHandle.destroyForcibly), 
which ends up at 
<http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/65464a307408/src/java.base/unix/native/libjava/ProcessHandleImpl_unix.c#l312>

> JNIEXPORT jboolean JNICALL
> Java_java_lang_ProcessHandleImpl_destroy0(JNIEnv *env,
>                                           jobject obj,
>                                           jlong jpid,
>                                           jlong startTime,
>                                           jboolean force) {
>     pid_t pid = (pid_t) jpid;
>     int sig = (force == JNI_TRUE) ? SIGKILL : SIGTERM;
>     jlong start = Java_java_lang_ProcessHandleImpl_isAlive0(env, obj, jpid);
> 
>     if (start == startTime || start == 0 || startTime == 0) {
>         return (kill(pid, sig) < 0) ? JNI_FALSE : JNI_TRUE;
>     } else {
>         return JNI_FALSE;
>     }
> }

with force=false, which thus also sends a SIGTERM.

To summarize (and if my browsing of the Jenkins source code is correct), 
Jenkins will send each process, in effectively random order, a SIGTERM, 
and will send no process a SIGKILL.

<https://gerrit.libreoffice.org/plugins/gitiles/lode/+/bea0738dbadfe8784e5d3c00f533acf101db4e7e%5E!/> 
"tb_slave_wrapper: trap signal and kill -9 everything"

> trap cleanup 1 2 3 6 15
> 
> cleanup()
> {
>   echo "Caught Signal ... killing everything...."
>   # kill everything in same process group (pseudo-pid 0)
>   kill -9 0
> }

ensures that, if tb_slave_wrapper receives a signal asking it to quit 
(incl. SIGTERM=15), it will eventually (i.e., once the Bash interpreter 
would proceed to a new line in the script) send a SIGKILL to all 
processes in the process group.

The processes (9)--(11) (pythion.bin, oosplash, soffice.bin) will not 
respond to their SIGTERM, as the former two are stuck waiting on their 
respective child, while the latter is otherwise deadlocked.  They need 
to rely on the SIGKILL that might get send from (3) tb-slave-wrapper.

But, as processes get the SIGTERM in a random order, if e.g. (8) (the 
innermost sh) gets a SIGTERM first, it will terminate and cause its 
parent chain (7), (6), (4), (3) to terminate too.  That means that (3) 
tb-slave-wrapper can terminate before it would receive a SIGTERM, so it 
will not call cleanup.

That should explain those occasional

> python.bin───oosplash───soffice.bin

zombie chains (hanging off processes 1) from timed-out Jenkins builds 
that we observe on tb75, tb76, and tb79.

Which in turn means that

On 29/12/2019 15:35, Noel Grandin wrote:
> If we can't fix this, I suggest we add:
> 
> kill $(ps -o pid= --ppid $$)
> 
> to the end of the Jenkins build script

would not help, as the problematic processes do not have that $$ as 
their (immediate) parent.

(And what would happen if the cleanup function in tb_slave_wrapper did 
kick in?  I think it would wreck havoc, as the outer Jenkins 
remoting.jar process, as well as any parallel tb_slave_wrapper 
instances, are in the same process group, and would thus all get 
forcibly killed.)



More information about the LibreOffice mailing list