[PATCH] drm/sched: Drain all entities in DRM sched run job worker
Christian König
christian.koenig at amd.com
Mon Jan 29 07:44:59 UTC 2024
Am 26.01.24 um 17:29 schrieb Matthew Brost:
> On Fri, Jan 26, 2024 at 11:32:57AM +0100, Christian König wrote:
>> Am 25.01.24 um 18:30 schrieb Matthew Brost:
>>> On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote:
>>>> Am 24.01.24 um 22:08 schrieb Matthew Brost:
>>>>> All entities must be drained in the DRM scheduler run job worker to
>>>>> avoid the following case. An entity found that is ready, no job found
>>>>> ready on entity, and run job worker goes idle with other entities + jobs
>>>>> ready. Draining all ready entities (i.e. loop over all ready entities)
>>>>> in the run job worker ensures all job that are ready will be scheduled.
>>>> That doesn't make sense. drm_sched_select_entity() only returns entities
>>>> which are "ready", e.g. have a job to run.
>>>>
>>> That is what I thought too, hence my original design but it is not
>>> exactly true. Let me explain.
>>>
>>> drm_sched_select_entity() returns an entity with a non-empty spsc queue
>>> (job in queue) and no *current* waiting dependecies [1]. Dependecies for
>>> an entity can be added when drm_sched_entity_pop_job() is called [2][3]
>>> returning a NULL job. Thus we can get into a scenario where 2 entities
>>> A and B both have jobs and no current dependecies. A's job is waiting
>>> B's job, entity A gets selected first, a dependecy gets installed in
>>> drm_sched_entity_pop_job(), run work goes idle, and now we deadlock.
>> And here is the real problem. run work doesn't goes idle in that moment.
>>
>> drm_sched_run_job_work() should restarts itself until there is either no
>> more space in the ring buffer or it can't find a ready entity any more.
>>
>> At least that was the original design when that was all still driven by a
>> kthread.
>>
>> It can perfectly be that we messed this up when switching from kthread to a
>> work item.
>>
> Right, that what this patch does - the run worker does not go idle until
> no ready entities are found. That was incorrect in the original patch
> and fixed here. Do you have any issues with this fix? It has been tested
> 3x times and clearly fixes the issue.
Ah! Yes in this case that patch here is a little bit ugly as well.
The original idea was that run_job restarts so that we are able to pause
the submission thread without searching for an entity to submit more.
I strongly suggest to replace the while loop with a call to
drm_sched_run_job_queue() so that when the entity can't provide a job we
just restart the queuing work.
Regards,
Christian.
>
>
> Matt
>
>> Regards,
>> Christian.
>>
>>> The proper solution is to loop over all ready entities until one with a
>>> job is found via drm_sched_entity_pop_job() and then requeue the run
>>> job worker. Or loop over all entities until drm_sched_select_entity()
>>> returns NULL and then let the run job worker go idle. This is what the
>>> old threaded design did too [4]. Hope this clears everything up.
>>>
>>> Matt
>>>
>>> [1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L144
>>> [2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L464
>>> [3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L397
>>> [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_main.c#L1011
>>>
>>>> If that's not the case any more then you have broken something else.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Cc: Thorsten Leemhuis <regressions at leemhuis.info>
>>>>> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov at gmail.com>
>>>>> Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@mail.gmail.com/
>>>>> Reported-and-tested-by: Mario Limonciello <mario.limonciello at amd.com>
>>>>> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124
>>>>> Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@amd.com/
>>>>> Reported-by: Vlastimil Babka <vbabka at suse.cz>
>>>>> Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a
>>>>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
>>>>> ---
>>>>> drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++--------
>>>>> 1 file changed, 7 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 550492a7a031..85f082396d42 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w)
>>>>> struct drm_sched_entity *entity;
>>>>> struct dma_fence *fence;
>>>>> struct drm_sched_fence *s_fence;
>>>>> - struct drm_sched_job *sched_job;
>>>>> + struct drm_sched_job *sched_job = NULL;
>>>>> int r;
>>>>> if (READ_ONCE(sched->pause_submit))
>>>>> return;
>>>>> - entity = drm_sched_select_entity(sched);
>>>>> + /* Find entity with a ready job */
>>>>> + while (!sched_job && (entity = drm_sched_select_entity(sched))) {
>>>>> + sched_job = drm_sched_entity_pop_job(entity);
>>>>> + if (!sched_job)
>>>>> + complete_all(&entity->entity_idle);
>>>>> + }
>>>>> if (!entity)
>>>>> - return;
>>>>> -
>>>>> - sched_job = drm_sched_entity_pop_job(entity);
>>>>> - if (!sched_job) {
>>>>> - complete_all(&entity->entity_idle);
>>>>> return; /* No more work */
>>>>> - }
>>>>> s_fence = sched_job->s_fence;
More information about the Intel-xe
mailing list