More information about hung Jenkins builds

Stephan Bergmann sbergman at redhat.com
Thu May 28 20:19:42 UTC 2020


Following up on the results of the email thread starting at 
<https://lists.freedesktop.org/archives/libreoffice/2019-December/084084.html> 
"How are Jenkins builds killed exactly?", 
<https://git.libreoffice.org/lode/+/bded43937c6efc82efc5924820a281c8a1ead5ba%5E%21> 
"kill-wrapper: pstree of hung processes" had tried to improve the 
information provided for a hung and aborted Jenkins build.  Typically, 
such a build is aborted because one or more tests hang, and it would be 
interesting to at least learn which tests hung.  To that end, that 
commit tried to print pstree output of any leftover processes---but 
failed, see the comment at 
<https://gerrit.libreoffice.org/c/lode/+/91496/2#message-8e52d669f48a9edb5f183d1221164784059e8959> 
"kill-wrapper: pstree of hung processes" for details.

Now, 
<https://git.libreoffice.org/lode/+/92c9372417f883781471bade5e703518bd1cd5c6%5E%21> 
"Incorporate timeout-on-idle into kill-wrapper, renaming to 
timeout-kill-wrapper" and its follow-up 
<https://git.libreoffice.org/lode/+/4d6d63299fea804ed7cdf63dde46922ed81b4e8a%5E%21> 
"Simplify transition from old kill-wrapper to new timeout kill-wrapper" 
fix that, by moving the timeout handling from Jenkins into lode's 
bin/kill-wrapper.  (Which accepts an optional second argument now, 
specifying a stdout/-err inactivity timeout in seconds, after which the 
pstree output is generated and the process tree gets killed.  Leaving 
the argument out or specifying it as zero disables that timeout logic.)

For now, I have updated 
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/> to use the 
new kill-wrapper timeout feature instead of Jenkins' "Abort the build if 
it's stuck" option.  (And am planning to roll it out to other Linux 
Jenkins jobs that could benefit from it, once it has proven sufficiently 
stable.)

<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/60539/> is a 
live example of such an aborted Gerrit Jenkins job.  One noticeable 
difference is that such a job is now marked as failed (red dot) rather 
than as aborted (gray dot).  But a new "kill-wrapper" (i.e., 
<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/failure-cause-management/48ce9c26-9d0a-43a8-83d8-c44f54920d59/>) 
failure cause label should make the actual reason of the failure 
obvious.  And the pstree output 
(<https://ci.libreoffice.org/job/gerrit_linux_clang_dbgutil/60539/consoleFull#147661240548ce9c26-9d0a-43a8-83d8-c44f54920d59>), 
while probably a bit overwhelming, should show that apparently all of 
UITest_calc_tests, UITest_calc_tests4, UITest_calc_tests7, UITest_chart, 
and UITest_demo_ui hung in this case.  That should give at least a hint 
where to start local debugging...



More information about the LibreOffice mailing list