[Bug 108456] userptr deadlock in Vulkan CTS: mmu_notifier vs workers vs struct_mutex

Wed Dec 5 01:04:02 UTC 2018

https://bugs.freedesktop.org/show_bug.cgi?id=108456

--- Comment #28 from Mark Janes <mark.a.janes at intel.com> ---
(In reply to Chris Wilson from comment #25)
> Also note that mesa kbl hangs were quite capable of locking up the system
> before v4.18... To fully protect ourselves we have to prevent any of the
> incremental mesa batches from continuing to execute after a hang, as the
> context contains none of the register settings or accrued GPU state that the
> batches depend upon.

This comment and my own statements in this bug have generated some confusion as
to how well the kernel recovers from GPU hang.  I'd like to provide some
background data on the effectiveness gpu hang recovery, from the i965 CI runs.

Mesa CI runs production kernels 4.14 - 4.18.  Occasionally (<1% of runs), a
random GPU hang may occur on older systems (esp SNB) or on newer systems due to
a bug with a mesa developer's patch.  In the vast majority of cases, the system
recovers from gpu hang.

In rare instances, Mesa developers test patches that are broken enough to cause
unrecoverable gpu hangs.  This happens a few times per month, requiring
manual/physical reboot on many systems (holding down power button or cutting
power).

drm-tip kernel currently generates a gpu hang on 80% of vulkancts runs with our
stable Mesa release.  Before the CI_DRM_5255 tag, 100% of the gpu hangs were
unrecoverable.  After the tag, the gpu hangs recover in the same way that the
production kernels do.

The drm-tip gpu hangs are separate from *this* bug, which is about deadlock. 
Unlike gpu hangs, we have no way to recover from the kernel deadlock documented
here.  The reboot command does work in this situation, so we can get the
machines back online in an automated way -- but the system will not
continue/finish the test workload.

I can't verify Chris's patch (which are on top of drm-tip) until the gpu hang
is fixed.  Chris, if you backport the patches to 4.18, I can test them in CI. 
I imagine they will need to be backported anyways, since the deadlock has been
seen as far back as 4.14.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20181205/955c6976/attachment.html>