[Intel-gfx] [PATCH] Revert "drm/i915: Propagate errors on awaiting already signaled fences"

Jason Ekstrand jason at jlekstrand.net
Wed May 19 19:01:07 UTC 2021


On May 19, 2021 12:16:15 Daniel Vetter <daniel.vetter at ffwll.ch> wrote:

> On Wed, May 19, 2021 at 5:06 PM Jason Ekstrand <jason at jlekstrand.net> wrote:
>>
>> Once we no longer rely on error propagation, I think there's a lot we
>> can rip out.
>
> I honestly did not find that much ... what did you uncover?

When I was digging through this earlier today, I think I convinced myself 
that the cmdparser is the only user of the entire fence error architecture. 
I may have missed something, though.

--Jason

>
> -Daniel
>
>>
>> --Jason
>>
>> On Wed, May 19, 2021 at 5:15 AM Daniel Vetter <daniel.vetter at ffwll.ch> wrote:
>>>
>>> From: Jason Ekstrand <jason at jlekstrand.net>
>>>
>>> This reverts commit 9e31c1fe45d555a948ff66f1f0e3fe1f83ca63f7.  Ever
>>> since that commit, we've been having issues where a hang in one client
>>> can propagate to another.  In particular, a hang in an app can propagate
>>> to the X server which causes the whole desktop to lock up.
>>>
>>> Error propagation along fences sound like a good idea, but as your bug
>>> shows, surprising consequences, since propagating errors across security
>>> boundaries is not a good thing.
>>>
>>> What we do have is track the hangs on the ctx, and report information to
>>> userspace using RESET_STATS. That's how arb_robustness works. Also, if my
>>> understanding is still correct, the EIO from execbuf is when your context
>>> is banned (because not recoverable or too many hangs). And in all these
>>> cases it's up to userspace to figure out what is all impacted and should
>>> be reported to the application, that's not on the kernel to guess and
>>> automatically propagate.
>>>
>>> What's more, we're also building more features on top of ctx error
>>> reporting with RESET_STATS ioctl: Encrypted buffers use the same, and the
>>> userspace fence wait also relies on that mechanism. So it is the path
>>> going forward for reporting gpu hangs and resets to userspace.
>>>
>>> So all together that's why I think we should just bury this idea again as
>>> not quite the direction we want to go to, hence why I think the revert is
>>> the right option here.Signed-off-by: Jason Ekstrand <jason.ekstrand at intel.com>
>>>
>>> v2: Augment commit message. Also restore Jason's sob that I
>>> accidentally lost.
>>>
>>> Signed-off-by: Jason Ekstrand <jason.ekstrand at intel.com> (v1)
>>> Reported-by: Marcin Slusarz <marcin.slusarz at intel.com>
>>> Cc: <stable at vger.kernel.org> # v5.6+
>>> Cc: Jason Ekstrand <jason.ekstrand at intel.com>
>>> Cc: Marcin Slusarz <marcin.slusarz at intel.com>
>>> Cc: Jon Bloomfield <jon.bloomfield at intel.com>
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3080
>>> Fixes: 9e31c1fe45d5 ("drm/i915: Propagate errors on awaiting already 
>>> signaled fences")
>>> Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
>>> ---
>>> drivers/gpu/drm/i915/i915_request.c | 8 ++------
>>> 1 file changed, 2 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_request.c 
>>> b/drivers/gpu/drm/i915/i915_request.c
>>> index 970d8f4986bb..b796197c0772 100644
>>> --- a/drivers/gpu/drm/i915/i915_request.c
>>> +++ b/drivers/gpu/drm/i915/i915_request.c
>>> @@ -1426,10 +1426,8 @@ i915_request_await_execution(struct i915_request *rq,
>>>
>>>  do {
>>>          fence = *child++;
>>> -               if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
>>> -                       i915_sw_fence_set_error_once(&rq->submit, 
>>> fence->error);
>>> +               if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>>                  continue;
>>> -               }
>>>
>>>          if (fence->context == rq->fence.context)
>>>                  continue;
>>> @@ -1527,10 +1525,8 @@ i915_request_await_dma_fence(struct i915_request 
>>> *rq, struct dma_fence *fence)
>>>
>>>  do {
>>>          fence = *child++;
>>> -               if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
>>> -                       i915_sw_fence_set_error_once(&rq->submit, 
>>> fence->error);
>>> +               if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>>                  continue;
>>> -               }
>>>
>>>          /*
>>>           * Requests on the same timeline are explicitly ordered, along
>>> --
>>> 2.31.0
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx/attachments/20210519/38005642/attachment.htm>


More information about the Intel-gfx mailing list