[RFC PATCH 06/10] drm/sched: Submit job before starting TDR
Luben Tuikov
luben.tuikov at amd.com
Thu Aug 31 19:48:23 UTC 2023
On 2023-07-31 03:26, Boris Brezillon wrote:
> +the PVR devs
>
> On Mon, 31 Jul 2023 01:00:59 +0000
> Matthew Brost <matthew.brost at intel.com> wrote:
>
>> On Thu, May 04, 2023 at 01:23:05AM -0400, Luben Tuikov wrote:
>>> On 2023-04-03 20:22, Matthew Brost wrote:
>>>> If the TDR is set to a value, it can fire before a job is submitted in
>>>> drm_sched_main. The job should be always be submitted before the TDR
>>>> fires, fix this ordering.
>>>>
>>>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
>>>> ---
>>>> drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 6ae710017024..4eac02d212c1 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -1150,10 +1150,10 @@ static void drm_sched_main(struct work_struct *w)
>>>> s_fence = sched_job->s_fence;
>>>>
>>>> atomic_inc(&sched->hw_rq_count);
>>>> - drm_sched_job_begin(sched_job);
>>>>
>>>> trace_drm_run_job(sched_job, entity);
>>>> fence = sched->ops->run_job(sched_job);
>>>> + drm_sched_job_begin(sched_job);
>>>> complete_all(&entity->entity_idle);
>>>> drm_sched_fence_scheduled(s_fence);
>>>>
>>>
>>> Not sure if this is correct. In drm_sched_job_begin() we add the job to the "pending_list"
>>> (meaning it is pending execution in the hardware) and we also start a timeout timer. Both
>>> of those should be started before the job is given to the hardware.
>>>
>>
>> The correct solution is probably add to pending list before run_job()
>> and kick TDR after run_job().
>
> This would make the PVR driver simpler too. Right now, the driver
> iterates over the pending job list to signal jobs done_fences, but
> there's a race between the interrupt handler (that's iterating over
> this list to signal fences) and the drm_sched logic (that's inserting
> the job in the pending_list after run_job() returns). The race is taken
> care of with an addition field that's pointing to the last submitted
> job [1], but if we can get rid of that logic, that's for the best.
>
> [1]https://gitlab.freedesktop.org/frankbinns/powervr/-/blob/powervr-next/drivers/gpu/drm/imagination/pvr_queue.h#L119
(Caching up, chronologically, after vacation...)
I agree on both emails above. I'm aware of this race in the DRM scheduler
but am careful not to open a can of worms if fixed.
But, yes, indeed, the classic way (which would avoid races) is indeed
to add to "pending list" before run_job, as we cannot guarantee the state
of the job after "run_job". Also, ideally we want to stop all submissions
and then call TDR, recover/reset/etc., and then resume incoming submissions.
--
Regards,
Luben
More information about the dri-devel
mailing list