[PATCH] drm/i915: Stop propagating fence errors by default

Mon May 17 15:33:45 UTC 2021

On 17/05/2021 16:12, Daniel Vetter wrote:
> On Tue, May 11, 2021 at 10:05:27AM +0100, Tvrtko Ursulin wrote:
>>
>> On 10/05/2021 16:55, Daniel Vetter wrote:
>>> On Fri, May 07, 2021 at 09:35:21AM +0100, Tvrtko Ursulin wrote:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>
>>>> This is an alternative proposed fix for the below references bug report
>>>> where dma fence error propagation is causing undesirable change in
>>>> behaviour post GPU hang/reset.
>>>>
>>>> Approach in this patch is to simply stop propagating all dma fence errors
>>>> by default since that seems to be the upstream ask.
>>>>
>>>> To handle the case where i915 needs error propagation for security, I add
>>>> a new dma fence flag DMA_FENCE_FLAG_PROPAGATE_ERROR and make use of it in
>>>> the command parsing chain only.
>>>>
>>>> It sounds a plausible argument that fence propagation could be useful in
>>>> which case a core flag to enable opt-in should be universally useful.
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>> Reported-by: Marcin Slusarz <marcin.slusarz at intel.com>
>>>> Reported-by: Miroslav Bendik
>>>> References: 9e31c1fe45d5 ("drm/i915: Propagate errors on awaiting already signaled fences")
>>>> References: https://gitlab.freedesktop.org/drm/intel/-/issues/3080
>>>> Cc: Jason Ekstrand <jason.ekstrand at intel.com>
>>>> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
>>>> ---
>>>>    drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 2 ++
>>>>    drivers/gpu/drm/i915/i915_sw_fence.c           | 8 ++++----
>>>>    drivers/gpu/drm/i915/i915_sw_fence.h           | 8 ++++++++
>>>>    include/linux/dma-fence.h                      | 1 +
>>>
>>> I still don't like this, least because we still introduce the concept of
>>> error propagation to dma-fence (but hey only in i915 code, which is
>>> exactly the kind of not-really-upstream approach we got a major chiding
>>> for).
>>>
>>> The only thing this does is make it explicitly opt-in instead opt-out,
>>> like the first fix. The right approach is imo still to just throw it out,
>>> and instead make the one error propagation we really need very, very
>>> explicit. Instead of hiding it behind lots of magic.
>>>
>>> The one error propagation we need is when the cmd parser work fails, it
>>> must cancel it's corresponding request to make sure the batchbuffer
>>> doesn't run. This should require about 2 lines in total:
>>>
>>> - one line to store the request so that the cmd parser work can access it.
>>>     No refcounting needed, because the the request cannot even start (much
>>>     less get freed) before the cmd parser has singalled its fence
>>>
>>> - one line to kill the request if the parsing fails. Maybe 2 if you
>>>     include the if condition. I have no idea how that's done since I'm
>>>     honestly lost how the i915 scheduler decides whether to run a batch or
>>>     not. I'm guessing we have a version of this for the ringbuffer and the
>>>     execlist backend (if not maybe gen7 cmdparser is broken?)
>>>
>>> I don't see any need for magic behind-the-scenes propagation of such a
>>> security critical error. Especially when that error propagation thing
>>> caused security bugs of its own, is an i915-only feature, and not
>>> motivated by any userspace/uapi requirements at all.
>>
>> I took this approach because to me propagating errors sounds more logical
>> than ignoring them and I was arguing in the commit message that the
>> infrastructure to enable that could be put in place as opt-in.
>>
>> I also do not see a lot of magic in this patch. Only thing, potentially the
>> logic should be inverted so that the waiter marks itself as interested in
>> receiving errors. That would probably make even more sense as a core
>> concept.
>>
>> Has there been a wider discussion on this topic in the past? I am curious to
>> know, even if propagation currently is i915 only, could other drivers be
>> interested.
> 
> There hasn't been. i915-gem team decided "this is a cool concept", which
> resulted in a security bug. Now we're a few months in arguing whether a
> cool-looking concept that leads to a security bug is maybe a good idea,
> and whether we should sneak it in as a core concept to dma-buf.h without
> any wider discussion on the concept.
> 
>> Note that it adds almost nothing to the dma-buf common code about a single
>> flag, and at some point (currently missing) documentation on the very flag.
> 
> This is really not how upstream collaboration works, and it needs to stop.
> 
> If you want this, start another thread arguing why this is a good idea,
> fully decoupled from the security fix here.

When I asked you whether you know there were past discussions on this 
topic, clearly the point of that was to figure out whether a new 
discussion needs to be started, or I need to go and read an existing one 
to get up to speed.

I don't know how you interpreted that as an attempt to sneak anything 
in. And I don't know how I could have reliably figured out the answer to 
that question without asking. So colour me confused.

To clarify on the security issue part - are you talking about 
https://gitlab.freedesktop.org/drm/intel/-/issues/3080, or the other 
security issue, the one which would be caused by simply reverting the 
error propagation in i915?

Regards,

Tvrtko