[PATCH] drm/sched: fix the bug of time out calculation
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Tue Aug 24 15:01:44 UTC 2021
On 2021-08-24 10:46 a.m., Andrey Grodzovsky wrote:
>
> On 2021-08-24 5:51 a.m., Monk Liu wrote:
>> the original logic is wrong that the timeout will not be retriggerd
>> after the previous job siganled, and that lead to the scenario that all
>> jobs in the same scheduler shares the same timeout timer from the very
>> begining job in this scheduler which is wrong.
>>
>> we should modify the timer everytime a previous job signaled.
>>
>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>> ---
>> drivers/gpu/drm/scheduler/sched_main.c | 12 ++++++++++++
>> 1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index a2a9536..fb27025 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -235,6 +235,13 @@ static void drm_sched_start_timeout(struct
>> drm_gpu_scheduler *sched)
>> schedule_delayed_work(&sched->work_tdr, sched->timeout);
>> }
>> +static void drm_sched_restart_timeout(struct drm_gpu_scheduler
>> *sched)
>> +{
>> + if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>> + !list_empty(&sched->pending_list))
>> + mod_delayed_work(system_wq, &sched->work_tdr, sched->timeout);
3d point - if list empty you need to cancel the timer, let the new job
coming after that restart it.
Andrey
>> +}
>> +
>> /**
>> * drm_sched_fault - immediately start timeout handler
>> *
>> @@ -693,6 +700,11 @@ drm_sched_get_cleanup_job(struct
>> drm_gpu_scheduler *sched)
>> if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>> /* remove job from pending_list */
>> list_del_init(&job->list);
>> +
>> + /* once the job deleted from pending list we should restart
>> + * the timeout calculation for the next job.
>> + */
>> + drm_sched_restart_timeout(sched);
>
>
> I think this should work, but 2 points -
>
> 1st you should probably remove this now
> https://elixir.bootlin.com/linux/v5.14-rc1/source/drivers/gpu/drm/scheduler/sched_main.c#L797
>
> 2nd - if you have two adjacent jobs started very closely you
> effectively letting the second job to be twice longer hang without TDR
> because
> you reset TDR timer for it when it's almost expired. If we could have
> TTL (time to live counter) for each job and then do mod_delayed_work
> to the TTL of the following job instead of just full timer reset then
> this would be more precise. But this is more of recommendation for
> improvement.
>
> Andrey
>
>
>> /* make the scheduled timestamp more accurate */
>> next = list_first_entry_or_null(&sched->pending_list,
>> typeof(*next), list);
More information about the amd-gfx
mailing list