[PATCH v5 02/16] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr

Tue Jun 29 11:18:58 UTC 2021

Hi Christian,

On Tue, 29 Jun 2021 13:03:58 +0200
Christian König <christian.koenig at amd.com> wrote:

> Am 29.06.21 um 09:34 schrieb Boris Brezillon:
> > Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
> > reset. This leads to extra complexity when we need to synchronize timeout
> > works with the reset work. One solution to address that is to have an
> > ordered workqueue at the driver level that will be used by the different
> > schedulers to queue their timeout work. Thanks to the serialization
> > provided by the ordered workqueue we are guaranteed that timeout
> > handlers are executed sequentially, and can thus easily reset the GPU
> > from the timeout handler without extra synchronization.  
> 
> Well, we had already tried this and it didn't worked the way it is expected.
> 
> The major problem is that you not only want to serialize the queue, but 
> rather have a single reset for all queues.
> 
> Otherwise you schedule multiple resets for each hardware queue. E.g. for 
> your 3 hardware queues you would reset the GPU 3 times if all of them 
> time out at the same time (which is rather likely).
> 
> Using a single delayed work item doesn't work either because you then 
> only have one timeout.
> 
> What could be done is to cancel all delayed work items from all stopped 
> schedulers.

drm_sched_stop() does that already, and since we call drm_sched_stop()
on all queues in the timeout handler, we end up with only one global
reset happening even if several queues report a timeout at the same
time.

Regards,

Boris