[Intel-gfx] [PATCH] drm/i915: Stop propagating fence errors by default

Tue May 11 09:05:27 UTC 2021

On 10/05/2021 16:55, Daniel Vetter wrote:
> On Fri, May 07, 2021 at 09:35:21AM +0100, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>
>> This is an alternative proposed fix for the below references bug report
>> where dma fence error propagation is causing undesirable change in
>> behaviour post GPU hang/reset.
>>
>> Approach in this patch is to simply stop propagating all dma fence errors
>> by default since that seems to be the upstream ask.
>>
>> To handle the case where i915 needs error propagation for security, I add
>> a new dma fence flag DMA_FENCE_FLAG_PROPAGATE_ERROR and make use of it in
>> the command parsing chain only.
>>
>> It sounds a plausible argument that fence propagation could be useful in
>> which case a core flag to enable opt-in should be universally useful.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>> Reported-by: Marcin Slusarz <marcin.slusarz at intel.com>
>> Reported-by: Miroslav Bendik
>> References: 9e31c1fe45d5 ("drm/i915: Propagate errors on awaiting already signaled fences")
>> References: https://gitlab.freedesktop.org/drm/intel/-/issues/3080
>> Cc: Jason Ekstrand <jason.ekstrand at intel.com>
>> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
>> ---
>>   drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 2 ++
>>   drivers/gpu/drm/i915/i915_sw_fence.c           | 8 ++++----
>>   drivers/gpu/drm/i915/i915_sw_fence.h           | 8 ++++++++
>>   include/linux/dma-fence.h                      | 1 +
> 
> I still don't like this, least because we still introduce the concept of
> error propagation to dma-fence (but hey only in i915 code, which is
> exactly the kind of not-really-upstream approach we got a major chiding
> for).
> 
> The only thing this does is make it explicitly opt-in instead opt-out,
> like the first fix. The right approach is imo still to just throw it out,
> and instead make the one error propagation we really need very, very
> explicit. Instead of hiding it behind lots of magic.
> 
> The one error propagation we need is when the cmd parser work fails, it
> must cancel it's corresponding request to make sure the batchbuffer
> doesn't run. This should require about 2 lines in total:
> 
> - one line to store the request so that the cmd parser work can access it.
>    No refcounting needed, because the the request cannot even start (much
>    less get freed) before the cmd parser has singalled its fence
> 
> - one line to kill the request if the parsing fails. Maybe 2 if you
>    include the if condition. I have no idea how that's done since I'm
>    honestly lost how the i915 scheduler decides whether to run a batch or
>    not. I'm guessing we have a version of this for the ringbuffer and the
>    execlist backend (if not maybe gen7 cmdparser is broken?)
> 
> I don't see any need for magic behind-the-scenes propagation of such a
> security critical error. Especially when that error propagation thing
> caused security bugs of its own, is an i915-only feature, and not
> motivated by any userspace/uapi requirements at all.

I took this approach because to me propagating errors sounds more 
logical than ignoring them and I was arguing in the commit message that 
the infrastructure to enable that could be put in place as opt-in.

I also do not see a lot of magic in this patch. Only thing, potentially 
the logic should be inverted so that the waiter marks itself as 
interested in receiving errors. That would probably make even more sense 
as a core concept.

Has there been a wider discussion on this topic in the past? I am 
curious to know, even if propagation currently is i915 only, could other 
drivers be interested.

Note that it adds almost nothing to the dma-buf common code about a 
single flag, and at some point (currently missing) documentation on the 
very flag.

Regards,

Tvrtko