[Bug 111747] [CI][DRMTIP] igt@ - incomplete - Jenkins gives up

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Tue Oct 1 09:24:48 UTC 2019


https://bugs.freedesktop.org/show_bug.cgi?id=111747

Petri Latvala <petri.latvala at intel.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|IGT                         |DRM/Intel
           Assignee|dri-devel at lists.freedesktop |intel-gfx-bugs at lists.freede
                   |.org                        |sktop.org
           Priority|not set                     |medium
           Severity|not set                     |normal
         QA Contact|                            |intel-gfx-bugs at lists.freede
                   |                            |sktop.org
      i915 features|GEM/Other                   |CI Infra

--- Comment #15 from Petri Latvala <petri.latvala at intel.com> ---
Happens to TGL in 5 / 16 runs (31.2%), last seen in: the previous build.

(I mention TGL since this bug seems to be for the TGL occurrences but it can
happen to any machine)

User impact for this issue in particular is N/A since it's a CI issue. However,
having incompletes reduces the coverage for any test that doesn't get run due
to this so potentially very dire. It doesn't happen at 100% regularity though,
and happens for arbitrary tests so coverage loss is not entirely up to the
potential cap.

What happens here is

1) Jenkins connects to DUT through ssh and launches tests
2) Jenkins loses ssh connection
3) The Jenkins job for executing the test finishes, because the ssh command
completed
4) At the end of finishing a test, a reboot-and-collect job is executed
5) The reboot-and-collect job connects through ssh and reboots the machine

The remote reboot job got a logging step added, tests that die due to the
reboot command prematurely invoked get a log entry in dmesg stating power.sh is
taking this machine down. From that we can determine that network didn't
completely die, just the ssh connection.

There is a plan to solve this. igt_runner will be changed to expose an AF_LOCAL
socket for outside control, and the Jenkins job for executing tests will then
no longer be required to maintain an ssh connection active for the duration of
the whole test round. Instead tests will be launched in the background (with
screen or tmux or just nohup) and the Jenkins job will reconnect the ssh
connection when/if it fails and check through igt_runner's control channel if a
test is still running.

Moving this bug to CI infra.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20191001/3a8a090e/attachment.html>


More information about the dri-devel mailing list