[PATCH] drm/sched: fix the bug of time out calculation(v3)
Christian König
christian.koenig at amd.com
Fri Aug 27 14:26:24 UTC 2021
Yes, I don't see any good reason for that either.
Christian.
Am 27.08.21 um 15:45 schrieb Andrey Grodzovsky:
> So we agree if (kthread_should_park()) return NULL should go away ?
>
> Andrey
>
>
> On 2021-08-27 3:46 a.m., Liu, Monk wrote:
>> [AMD Official Use Only]
>>
>> Yeah, that "kthread_should_park" is also irrelevant looks to me as
>> well and it delays the signaled job's cleanup/free
>>
>> Thanks
>>
>> ------------------------------------------
>> Monk Liu | Cloud-GPU Core team
>> ------------------------------------------
>>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>> Sent: Friday, August 27, 2021 2:12 PM
>> To: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Liu, Monk
>> <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org; Koenig, Christian
>> <Christian.Koenig at amd.com>
>> Cc: dri-devel at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)
>>
>> I don't think that this will be necessary nor desired.
>>
>> See the job should be cleaned up as soon as possible after it is
>> finished or otherwise we won't cancel the timeout quick enough either.
>>
>> Christian.
>>
>> Am 26.08.21 um 22:14 schrieb Andrey Grodzovsky:
>>> Attached quick patch for per job TTL calculation to make more precises
>>> next timer expiration. It's on top of the patch in this thread. Let me
>>> know if this makes sense.
>>>
>>> Andrey
>>>
>>> On 2021-08-26 10:03 a.m., Andrey Grodzovsky wrote:
>>>> On 2021-08-26 12:55 a.m., Monk Liu wrote:
>>>>> issue:
>>>>> in cleanup_job the cancle_delayed_work will cancel a TO timer even
>>>>> the its corresponding job is still running.
>>>>>
>>>>> fix:
>>>>> do not cancel the timer in cleanup_job, instead do the cancelling
>>>>> only when the heading job is signaled, and if there is a "next" job
>>>>> we start_timeout again.
>>>>>
>>>>> v2:
>>>>> further cleanup the logic, and do the TDR timer cancelling if the
>>>>> signaled job is the last one in its scheduler.
>>>>>
>>>>> v3:
>>>>> change the issue description
>>>>> remove the cancel_delayed_work in the begining of the cleanup_job
>>>>> recover the implement of drm_sched_job_begin.
>>>>>
>>>>> TODO:
>>>>> 1)introduce pause/resume scheduler in job_timeout to serial the
>>>>> handling of scheduler and job_timeout.
>>>>> 2)drop the bad job's del and insert in scheduler due to above
>>>>> serialization (no race issue anymore with the serialization)
>>>>>
>>>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/scheduler/sched_main.c | 25
>>>>> ++++++++++---------------
>>>>> 1 file changed, 10 insertions(+), 15 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index a2a9536..ecf8140 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -676,13 +676,7 @@ drm_sched_get_cleanup_job(struct
>>>>> drm_gpu_scheduler *sched)
>>>>> {
>>>>> struct drm_sched_job *job, *next;
>>>>> - /*
>>>>> - * Don't destroy jobs while the timeout worker is running OR
>>>>> thread
>>>>> - * is being parked and hence assumed to not touch pending_list
>>>>> - */
>>>>> - if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>> - !cancel_delayed_work(&sched->work_tdr)) ||
>>>>> - kthread_should_park())
>>>>> + if (kthread_should_park())
>>>>> return NULL;
>>>>
>>>> I actually don't see why we need to keep the above, on the other side
>>>> (in drm_sched_stop) we won't touch the pending list anyway until
>>>> sched thread came to full stop (kthread_park). If you do see a reason
>>>> why this needed then a comment should be here i think.
>>>>
>>>> Andrey
>>>>
>>>>
>>>>> spin_lock(&sched->job_list_lock);
>>>>> @@ -693,17 +687,21 @@ drm_sched_get_cleanup_job(struct
>>>>> drm_gpu_scheduler *sched)
>>>>> if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>>>>> /* remove job from pending_list */
>>>>> list_del_init(&job->list);
>>>>> +
>>>>> + /* cancel this job's TO timer */
>>>>> + cancel_delayed_work(&sched->work_tdr);
>>>>> /* make the scheduled timestamp more accurate */
>>>>> next = list_first_entry_or_null(&sched->pending_list,
>>>>> typeof(*next), list);
>>>>> - if (next)
>>>>> +
>>>>> + if (next) {
>>>>> next->s_fence->scheduled.timestamp =
>>>>> job->s_fence->finished.timestamp;
>>>>> -
>>>>> + /* start TO timer for next job */
>>>>> + drm_sched_start_timeout(sched);
>>>>> + }
>>>>> } else {
>>>>> job = NULL;
>>>>> - /* queue timeout for next job */
>>>>> - drm_sched_start_timeout(sched);
>>>>> }
>>>>> spin_unlock(&sched->job_list_lock);
>>>>> @@ -791,11 +789,8 @@ static int drm_sched_main(void *param)
>>>>> (entity = drm_sched_select_entity(sched)))
>>>>> ||
>>>>> kthread_should_stop());
>>>>> - if (cleanup_job) {
>>>>> + if (cleanup_job)
>>>>> sched->ops->free_job(cleanup_job);
>>>>> - /* queue timeout for next job */
>>>>> - drm_sched_start_timeout(sched);
>>>>> - }
>>>>> if (!entity)
>>>>> continue;
More information about the dri-devel
mailing list