[Intel-gfx] [PATCH] Revert "drm/i915: Propagate errors on awaiting already signaled fences"
Jason Ekstrand
jason at jlekstrand.net
Wed May 19 15:06:40 UTC 2021
Once we no longer rely on error propagation, I think there's a lot we
can rip out.
--Jason
On Wed, May 19, 2021 at 5:15 AM Daniel Vetter <daniel.vetter at ffwll.ch> wrote:
>
> From: Jason Ekstrand <jason at jlekstrand.net>
>
> This reverts commit 9e31c1fe45d555a948ff66f1f0e3fe1f83ca63f7. Ever
> since that commit, we've been having issues where a hang in one client
> can propagate to another. In particular, a hang in an app can propagate
> to the X server which causes the whole desktop to lock up.
>
> Error propagation along fences sound like a good idea, but as your bug
> shows, surprising consequences, since propagating errors across security
> boundaries is not a good thing.
>
> What we do have is track the hangs on the ctx, and report information to
> userspace using RESET_STATS. That's how arb_robustness works. Also, if my
> understanding is still correct, the EIO from execbuf is when your context
> is banned (because not recoverable or too many hangs). And in all these
> cases it's up to userspace to figure out what is all impacted and should
> be reported to the application, that's not on the kernel to guess and
> automatically propagate.
>
> What's more, we're also building more features on top of ctx error
> reporting with RESET_STATS ioctl: Encrypted buffers use the same, and the
> userspace fence wait also relies on that mechanism. So it is the path
> going forward for reporting gpu hangs and resets to userspace.
>
> So all together that's why I think we should just bury this idea again as
> not quite the direction we want to go to, hence why I think the revert is
> the right option here.Signed-off-by: Jason Ekstrand <jason.ekstrand at intel.com>
>
> v2: Augment commit message. Also restore Jason's sob that I
> accidentally lost.
>
> Signed-off-by: Jason Ekstrand <jason.ekstrand at intel.com> (v1)
> Reported-by: Marcin Slusarz <marcin.slusarz at intel.com>
> Cc: <stable at vger.kernel.org> # v5.6+
> Cc: Jason Ekstrand <jason.ekstrand at intel.com>
> Cc: Marcin Slusarz <marcin.slusarz at intel.com>
> Cc: Jon Bloomfield <jon.bloomfield at intel.com>
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3080
> Fixes: 9e31c1fe45d5 ("drm/i915: Propagate errors on awaiting already signaled fences")
> Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
> ---
> drivers/gpu/drm/i915/i915_request.c | 8 ++------
> 1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 970d8f4986bb..b796197c0772 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1426,10 +1426,8 @@ i915_request_await_execution(struct i915_request *rq,
>
> do {
> fence = *child++;
> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
> - i915_sw_fence_set_error_once(&rq->submit, fence->error);
> + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> continue;
> - }
>
> if (fence->context == rq->fence.context)
> continue;
> @@ -1527,10 +1525,8 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence)
>
> do {
> fence = *child++;
> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
> - i915_sw_fence_set_error_once(&rq->submit, fence->error);
> + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> continue;
> - }
>
> /*
> * Requests on the same timeline are explicitly ordered, along
> --
> 2.31.0
>
More information about the Intel-gfx
mailing list