[PATCH] drm/sched: Drain all entities in DRM sched run job worker

Mon Jan 29 17:10:52 UTC 2024

On 2024-01-29 02:44, Christian König wrote:
> Am 26.01.24 um 17:29 schrieb Matthew Brost:
>> On Fri, Jan 26, 2024 at 11:32:57AM +0100, Christian König wrote:
>>> Am 25.01.24 um 18:30 schrieb Matthew Brost:
>>>> On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
>>>>> Am 24.01.24 um 22:08 schrieb Matthew Brost:
>>>>>> All entities must be drained in the DRM scheduler run job worker to
>>>>>> avoid the following case. An entity found that is ready, no job found
>>>>>> ready on entity, and run job worker goes idle with other entities + jobs
>>>>>> ready. Draining all ready entities (i.e. loop over all ready entities)
>>>>>> in the run job worker ensures all job that are ready will be scheduled.
>>>>> That doesn't make sense. drm_sched_select_entity() only returns entities
>>>>> which are "ready", e.g. have a job to run.
>>>>>
>>>> That is what I thought too, hence my original design but it is not
>>>> exactly true. Let me explain.
>>>>
>>>> drm_sched_select_entity() returns an entity with a non-empty spsc queue
>>>> (job in queue) and no *current* waiting dependecies [1]. Dependecies for
>>>> an entity can be added when drm_sched_entity_pop_job() is called [2][3]
>>>> returning a NULL job. Thus we can get into a scenario where 2 entities
>>>> A and B both have jobs and no current dependecies. A's job is waiting
>>>> B's job, entity A gets selected first, a dependecy gets installed in
>>>> drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.
>>> And here is the real problem. run work doesn't goes idle in that moment.
>>>
>>> drm_sched_run_job_work() should restarts itself until there is either no
>>> more space in the ring buffer or it can't find a ready entity any more.
>>>
>>> At least that was the original design when that was all still driven by a
>>> kthread.
>>>
>>> It can perfectly be that we messed this up when switching from kthread to a
>>> work item.
>>>
>> Right, that what this patch does - the run worker does not go idle until
>> no ready entities are found. That was incorrect in the original patch
>> and fixed here. Do you have any issues with this fix? It has been tested
>> 3x times and clearly fixes the issue.
> 
> Ah! Yes in this case that patch here is a little bit ugly as well.
> 
> The original idea was that run_job restarts so that we are able to pause 
> the submission thread without searching for an entity to submit more.
> 
> I strongly suggest to replace the while loop with a call to 
> drm_sched_run_job_queue() so that when the entity can't provide a job we 
> just restart the queuing work.

I agree with Christian. This more closely preserves the original design
of the GPU schedulers, so we should go with that.
-- 
Regards,
Luben
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0x4C15479431A334AF.asc
Type: application/pgp-keys
Size: 664 bytes
Desc: OpenPGP public key
URL: <https://lists.freedesktop.org/archives/intel-xe/attachments/20240129/a9d36411/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 236 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/intel-xe/attachments/20240129/a9d36411/attachment-0001.sig>