[PATCH v5 02/16] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr

Tue Jun 29 11:24:53 UTC 2021

Am 29.06.21 um 13:18 schrieb Boris Brezillon:
> Hi Christian,
>
> On Tue, 29 Jun 2021 13:03:58 +0200
> Christian König <christian.koenig at amd.com> wrote:
>
>> Am 29.06.21 um 09:34 schrieb Boris Brezillon:
>>> Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
>>> reset. This leads to extra complexity when we need to synchronize timeout
>>> works with the reset work. One solution to address that is to have an
>>> ordered workqueue at the driver level that will be used by the different
>>> schedulers to queue their timeout work. Thanks to the serialization
>>> provided by the ordered workqueue we are guaranteed that timeout
>>> handlers are executed sequentially, and can thus easily reset the GPU
>>> from the timeout handler without extra synchronization.
>> Well, we had already tried this and it didn't worked the way it is expected.
>>
>> The major problem is that you not only want to serialize the queue, but
>> rather have a single reset for all queues.
>>
>> Otherwise you schedule multiple resets for each hardware queue. E.g. for
>> your 3 hardware queues you would reset the GPU 3 times if all of them
>> time out at the same time (which is rather likely).
>>
>> Using a single delayed work item doesn't work either because you then
>> only have one timeout.
>>
>> What could be done is to cancel all delayed work items from all stopped
>> schedulers.
> drm_sched_stop() does that already, and since we call drm_sched_stop()
> on all queues in the timeout handler, we end up with only one global
> reset happening even if several queues report a timeout at the same
> time.

Ah, nice. Yeah, in this case it should indeed work as expected.

Feel free to add an Acked-by: Christian König <christian.koenig at amd.com> 
to it.

Regards,
Christian.

>
> Regards,
>
> Boris