[PATCH v5 02/16] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr

Tue Sep 7 18:53:58 UTC 2021

On 2021-06-29 7:24 a.m., Christian König wrote:

> Am 29.06.21 um 13:18 schrieb Boris Brezillon:
>> Hi Christian,
>>
>> On Tue, 29 Jun 2021 13:03:58 +0200
>> Christian König <christian.koenig at amd.com> wrote:
>>
>>> Am 29.06.21 um 09:34 schrieb Boris Brezillon:
>>>> Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
>>>> reset. This leads to extra complexity when we need to synchronize 
>>>> timeout
>>>> works with the reset work. One solution to address that is to have an
>>>> ordered workqueue at the driver level that will be used by the 
>>>> different
>>>> schedulers to queue their timeout work. Thanks to the serialization
>>>> provided by the ordered workqueue we are guaranteed that timeout
>>>> handlers are executed sequentially, and can thus easily reset the GPU
>>>> from the timeout handler without extra synchronization.
>>> Well, we had already tried this and it didn't worked the way it is 
>>> expected.
>>>
>>> The major problem is that you not only want to serialize the queue, but
>>> rather have a single reset for all queues.
>>>
>>> Otherwise you schedule multiple resets for each hardware queue. E.g. 
>>> for
>>> your 3 hardware queues you would reset the GPU 3 times if all of them
>>> time out at the same time (which is rather likely).
>>>
>>> Using a single delayed work item doesn't work either because you then
>>> only have one timeout.
>>>
>>> What could be done is to cancel all delayed work items from all stopped
>>> schedulers.
>> drm_sched_stop() does that already, and since we call drm_sched_stop()
>> on all queues in the timeout handler, we end up with only one global
>> reset happening even if several queues report a timeout at the same
>> time.
>
> Ah, nice. Yeah, in this case it should indeed work as expected.
>
> Feel free to add an Acked-by: Christian König 
> <christian.koenig at amd.com> to it.
>
> Regards,
> Christian.

Seems to me that for this to work we need to change cancel_delayed_work 
to cancel_delayed_work_sync
so not only pending TO handlers  are cancelled but also any in progress 
are waited for and to to prevent rearming.
Also move it right after kthread_park - before we start touching pending 
list.

Andrey

>
>>
>> Regards,
>>
>> Boris
>