[PATCH] drm/sched: fix the bug of time out calculation(v2)

Wed Aug 25 18:20:22 UTC 2021

On 2021-08-25 8:11 a.m., Christian König wrote:
> No, this would break that logic here.
>
> See drm_sched_start_timeout() can be called multiple times, this is 
> intentional and very important!
>
> The logic in queue_delayed_work() makes sure that the timer is only 
> started once and then never again.
>
> All we need to take care of is to cancel_delayed_work() when we know 
> that the job is completed.

Seems to me you can only do it for empty pending list otherwise you risk 
cancelling a legit new timer that was started
by the next job or not restarting timer at all since your timer was 
still pending when next job tried to start it again (the common case).
For non empty pending list you have to adjust the currently active TDR's 
timer from your's
job TTL to TTL to the next job after you or just restart it as Monk does 
it here which prolongs the timeout more then required but still ok i guess.

What about returning to the old scheme of timer sched_work per job so 
each job has it's own timer and we don't share it and everything
is precise for each job, using the locking scheme we already have today 
the actual TDR handler will execute only once while
all the other arising from the guilty job hang will be rejected (for 
amdgpu, for other drivers it probably requires same locking or we can 
move this to the scheduler layer)

Andrey

>
> This here works as intended as far as I can see and if you start to 
> use mod_delayed_work() you actually break it.
>
> Regards,
> Christian.
>
> Am 25.08.21 um 14:01 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>> I think we should remove the cancel_delayed_work() in the beginning 
>> of the cleanup_job().
>>
>> Because by my patch the "mode_delayed_work" in cleanup_job is already 
>> doing its duty to retrigger the TO timer accordingly
>>
>> Thanks
>>
>> ------------------------------------------
>> Monk Liu | Cloud-GPU Core team
>> ------------------------------------------
>>
>> -----Original Message-----
>> From: Liu, Monk
>> Sent: Wednesday, August 25, 2021 7:55 PM
>> To: 'Christian König' <ckoenig.leichtzumerken at gmail.com>; 
>> amd-gfx at lists.freedesktop.org
>> Subject: RE: [PATCH] drm/sched: fix the bug of time out calculation(v2)
>>
>> [AMD Official Use Only]
>>
>>>> The timeout started by queue_delayed_work() in 
>>>> drm_sched_start_timeout() is paired with the cancel_delayed_work() 
>>>> in drm_sched_get_cleanup_job().
>> No that's wrong, see that when we are in cleanup_job(), assume we do 
>> not have timeout on this sched (we are just keep submitting new jobs 
>> to this sched), Then the work_tdr is cancelled, and then we get the 
>> heading job, and let's assume the job is not signaled, then we run to 
>> the "queue timeout for next job" thus drm_sched_start_timeout() is 
>> called, so this heading job's TO timer is actually retriggered ... 
>> which is totally wrong.
>>
>> With my patch the timer is already retriggered after previous JOB 
>> really signaled.
>>
>> Can you be more specific on the incorrect part ?
>>
>> Thanks
>> ------------------------------------------
>> Monk Liu | Cloud-GPU Core team
>> ------------------------------------------
>>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>> Sent: Wednesday, August 25, 2021 2:32 PM
>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/sched: fix the bug of time out calculation(v2)
>>
>> Well NAK to that approach. First of all your bug analyses is incorrect.
>>
>> The timeout started by queue_delayed_work() in 
>> drm_sched_start_timeout() is paired with the cancel_delayed_work() in 
>> drm_sched_get_cleanup_job().
>>
>> So you must have something else going on here.
>>
>> Then please don't use mod_delayed_work(), instead always cancel it 
>> and restart it.
>>
>> Regards,
>> Christian.
>>
>> Am 25.08.21 um 06:14 schrieb Monk Liu:
>>> the original logic is wrong that the timeout will not be retriggerd
>>> after the previous job siganled, and that lead to the scenario that
>>> all jobs in the same scheduler shares the same timeout timer from the
>>> very begining job in this scheduler which is wrong.
>>>
>>> we should modify the timer everytime a previous job signaled.
>>>
>>> v2:
>>> further cleanup the logic, and do the TDR timer cancelling if the
>>> signaled job is the last one in its scheduler.
>>>
>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 29 
>>> ++++++++++++++++++++---------
>>>    1 file changed, 20 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index a2a9536..8c102ac 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -305,8 +305,17 @@ static void drm_sched_job_begin(struct 
>>> drm_sched_job *s_job)
>>>        struct drm_gpu_scheduler *sched = s_job->sched;
>>>           spin_lock(&sched->job_list_lock);
>>> -    list_add_tail(&s_job->list, &sched->pending_list);
>>> -    drm_sched_start_timeout(sched);
>>> +    if (list_empty(&sched->pending_list)) {
>>> +        list_add_tail(&s_job->list, &sched->pending_list);
>>> +        drm_sched_start_timeout(sched);
>>> +    } else {
>>> +        /* the old jobs in pending list are not finished yet
>>> +         * no need to restart TDR timer here, it is already
>>> +         * handled by drm_sched_get_cleanup_job
>>> +         */
>>> +        list_add_tail(&s_job->list, &sched->pending_list);
>>> +    }
>>> +
>>>        spin_unlock(&sched->job_list_lock);
>>>    }
>>>    @@ -693,17 +702,22 @@ drm_sched_get_cleanup_job(struct 
>>> drm_gpu_scheduler *sched)
>>>        if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>>>            /* remove job from pending_list */
>>>            list_del_init(&job->list);
>>> +
>>>            /* make the scheduled timestamp more accurate */
>>>            next = list_first_entry_or_null(&sched->pending_list,
>>>                            typeof(*next), list);
>>> -        if (next)
>>> +        if (next) {
>>> +            /* if we still have job in pending list we need modify 
>>> the TDR timer */
>>> +            mod_delayed_work(system_wq, &sched->work_tdr, 
>>> sched->timeout);
>>>                next->s_fence->scheduled.timestamp =
>>>                    job->s_fence->finished.timestamp;
>>> +        } else {
>>> +            /* cancel the TDR timer if no job in pending list */
>>> +            cancel_delayed_work(&sched->work_tdr);
>>> +        }
>>>           } else {
>>>            job = NULL;
>>> -        /* queue timeout for next job */
>>> -        drm_sched_start_timeout(sched);
>>>        }
>>>           spin_unlock(&sched->job_list_lock);
>>> @@ -791,11 +805,8 @@ static int drm_sched_main(void *param)
>>>                          (entity = drm_sched_select_entity(sched))) ||
>>>                         kthread_should_stop());
>>>    -        if (cleanup_job) {
>>> +        if (cleanup_job)
>>>                sched->ops->free_job(cleanup_job);
>>> -            /* queue timeout for next job */
>>> -            drm_sched_start_timeout(sched);
>>> -        }
>>>               if (!entity)
>>>                continue;
>