[PATCH] drm/sched: fix the bug of time out calculation

Tue Aug 24 15:01:44 UTC 2021

On 2021-08-24 10:46 a.m., Andrey Grodzovsky wrote:
>
> On 2021-08-24 5:51 a.m., Monk Liu wrote:
>> the original logic is wrong that the timeout will not be retriggerd
>> after the previous job siganled, and that lead to the scenario that all
>> jobs in the same scheduler shares the same timeout timer from the very
>> begining job in this scheduler which is wrong.
>>
>> we should modify the timer everytime a previous job signaled.
>>
>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index a2a9536..fb27025 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -235,6 +235,13 @@ static void drm_sched_start_timeout(struct 
>> drm_gpu_scheduler *sched)
>>           schedule_delayed_work(&sched->work_tdr, sched->timeout);
>>   }
>>   +static void drm_sched_restart_timeout(struct drm_gpu_scheduler 
>> *sched)
>> +{
>> +    if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>> +        !list_empty(&sched->pending_list))
>> +        mod_delayed_work(system_wq, &sched->work_tdr, sched->timeout);

3d point - if list empty you need to cancel the timer, let the new job 
coming after that restart it.

Andrey

>> +}
>> +
>>   /**
>>    * drm_sched_fault - immediately start timeout handler
>>    *
>> @@ -693,6 +700,11 @@ drm_sched_get_cleanup_job(struct 
>> drm_gpu_scheduler *sched)
>>       if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>>           /* remove job from pending_list */
>>           list_del_init(&job->list);
>> +
>> +        /* once the job deleted from pending list we should restart
>> +         * the timeout calculation for the next job.
>> +         */
>> +        drm_sched_restart_timeout(sched);
>
>
> I think this should work, but 2 points -
>
> 1st you should probably remove this now 
> https://elixir.bootlin.com/linux/v5.14-rc1/source/drivers/gpu/drm/scheduler/sched_main.c#L797
>
> 2nd - if you have two adjacent jobs started very closely you 
> effectively letting the second job to be twice longer hang without TDR 
> because
> you reset TDR timer for it when it's almost expired. If we could have 
> TTL (time to live counter) for each job and then do mod_delayed_work
> to the TTL of the following job instead of just full timer reset then 
> this would be more precise. But this is more of recommendation for 
> improvement.
>
> Andrey
>
>
>>           /* make the scheduled timestamp more accurate */
>>           next = list_first_entry_or_null(&sched->pending_list,
>>                           typeof(*next), list);