[PATCH] nouveau: offload fence uevents work to workqueue

Mon Feb 19 10:14:11 UTC 2024

On 2/16/24 17:41, Daniel Vetter wrote:
> On Tue, Feb 13, 2024 at 06:39:20PM +0100, Danilo Krummrich wrote:
>> On 2/9/24 19:52, Daniel Vetter wrote:
>>> On Fri, Feb 09, 2024 at 06:41:32PM +0100, Danilo Krummrich wrote:
>>>> On 2/6/24 15:03, Daniel Vetter wrote:
>>>>> On Mon, Feb 05, 2024 at 11:00:04PM +0100, Danilo Krummrich wrote:
>>>>>> On 2/5/24 22:08, Dave Airlie wrote:
>>>>>>> On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr at redhat.com> wrote:
>>>>>>>>
>>>>>>>> On 1/29/24 02:50, Dave Airlie wrote:
>>>>>>>>> From: Dave Airlie <airlied at redhat.com>
>>>>>>>>>
>>>>>>>>> This should break the deadlock between the fctx lock and the irq lock.
>>>>>>>>>
>>>>>>>>> This offloads the processing off the work from the irq into a workqueue.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Dave Airlie <airlied at redhat.com>
>>>>>>>>
>>>>>>>> Nouveau's scheduler uses a dedicated wq, hence from this perspective it's
>>>>>>>> safe deferring fence signalling to the kernel global wq. However, I wonder
>>>>>>>> if we could create deadlocks by building dependency chains into other
>>>>>>>> drivers / kernel code that, by chance, makes use of the kernel global wq as
>>>>>>>> well.
>>>>>>>>
>>>>>>>> Admittedly, even if, it's gonna be extremely unlikely given that
>>>>>>>> WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated wq.
>>>>>>>>
>>>>>>>> Also, do we need to CC stable?
>>>>>>>
>>>>>>> I pushed this to Linus at the end of last week, since the hangs in 6.7
>>>>>>> take out the complete system and I wanted it in stable.
>>>>>>>
>>>>>>> It might be safer to use a dedicated wq, is the concern someone is
>>>>>>> waiting on a fence in a workqueue somewhere else so we will never
>>>>>>> signal it?
>>>>>>
>>>>>> Yes, if some other work is waiting for this fence (or something else in the same
>>>>>> dependency chain) to signal it can prevent executing the work signaling this fence,
>>>>>> in case both are scheduled on the same wq. As mentioned, with the kernel global wq
>>>>>> this would be rather unlikely to lead to an actual stall with WQ_MAX_ACTIVE == 512,
>>>>>> but formally the race condition exists. I guess a malicious attacker could try to
>>>>>> intentionally push jobs directly or indirectly depending on this fence to a driver
>>>>>> which queues them up on a scheduler using the kernel global wq.
>>>>>
>>>>> I think if you add dma_fence_signalling annotations (aside, there's some
>>>>> patch from iirc Thomas Hellstrom to improve them and cut down on some
>>>>> false positives, but I lost track) then I think you won't get any splats
>>>>> because the wq subsystem assumes that WC_MAX_ACTIVE is close enough to
>>>>> infinity to not matter.
>>>>
>>>> As mentioned, for the kernel global wq it's 512. (Intentionally) feeding the kernel
>>>> with enough jobs to to provoke a deadlock doesn't seem impossible to me.
>>>>
>>>> I think it'd be safer to just establish not to use the kernel global wq for executing
>>>> work in the fence signalling critical path.
>>>>
>>>> We could also run into similar problems with a dedicated wq, e.g. when drivers share
>>>> a wq between drm_gpu_scheduler instances (see [1]), however, I'm not sure we can catch
>>>> that with lockdep.
>>>
>>> I think if you want to fix it perfectly you'd need to set the max number
>>> of wq to the number of engines (or for dynamic/fw scheduled engines to the
>>> number of context) you have. Or whatever limit to the number of parallel
>>> timelines there is.> I guess this would need a new wq function to
>>> update? drm/sched code could
>>> be able to set that for drivers, so drivers cannot get this wrong.
>>
>> Not sure I can follow. The scheduler instance might be per context and bind
>> queue. In this case it gets the shared wq passed, but doesn't know how many
>> other scheduler instances share the same one.
> 
> Yeah that's why maybe more of that logic should be in the drm/sched code
> instead of drivers just cleverly using what's there ...

Agree, that's gonna be a huge design change though.

> 
>> Additionally, there might be drivers not using the DRM scheduler for for bind
>> queues at all (I think Xe does not).
> 
> Uh ... maybe we should do this the same across all drivers? But I also
> thought that Xe was flat-out synchronous and only had an out-fence since
> you need a userspace thread in mesa for vk semantics anyway.
> 
> If xe hand-rolled a scheduler I'm not going to be very amused.

I really don't know the details, but there are similarities at least.

There is the the rebind work, which seems to be called in some VM_BIND cases and
in the context of an EXEC ioctl and seems to signal a fence. It seems valid to
not stuff this into the scheduler.

There are also cases like this one, where we have fence signalling critical code
in wqs outside the context of a scheduler instance.

> 
>>> If we don't do something like that then I'm not sure there's really much
>>> benefit - instead of carefully timing 512 dma_fence on the global wq you
>>> just need to create a pile of context (at least on intel's guc that's
>>> absolutely no issue) and then careful time them.
>>
>> Well, that's true. I'd still argue that there is a slight difference. From a
>> drivers isolated perspective using the global kernel wq might be entirely fine,
>> as in this patch. However, in combination with another driver doing the same
>> thing, things can blow up. For the case you illustrated it's at least possible
>> to spot it from a driver's perspective.
>>
>>>
>>> I still feel like we have bigger fish to fry ... But might be worth the
>>> effort to at least make the parallel timeline limit a lot more explicit?
>>
>> I agree, and it'd be great if we can find a solution such bugs can be detected
>> systematically (e.g. through lockdep), but maybe we can start to at least
>> document that we should never use the kernel global wq and where we need to be
>> careful in sharing driver wqs.
> 
> Yeah I guess the above two are other reasons why maybe we need a bit more
> structure in scheduler apis instead of just allowing drivers to hand in
> shared wq pointers. Something like a struct drm_sched_domain, which
> contains the wq + a list of drm_sched for it. Would also make stuff like
> reliably stopping the right amount of schedulers in tdr much more robust.

Yeah, that sounds like a good starting point. I can also add a corresponding entry
to the DRM TODO list explaining the issue.

- Danilo

> -Sima
> 
>>
>> - Danilo
>>
>>>
>>> Cheers, Sima
>>>
>>>>
>>>> [1] https://elixir.bootlin.com/linux/v6.8-rc3/source/drivers/gpu/drm/nouveau/nouveau_drm.c#L313
>>>>
>>>>>
>>>>> I'm not sure we should care differently, but I guess it'd be good to
>>>>> annotate it all in case the wq subsystem's idea of how much such deadlocks
>>>>> are real changes.
>>>>>
>>>>> Also Teo is on a mission to get rid of all the global wq flushes, so there
>>>>> should also be no source of deadlocks from that kind of cross-driver
>>>>> dependency. Or at least shouldn't be in the future, I'm not sure it all
>>>>> landed.
>>>>> -Sima
>>>>
>>>
>>
>