[PATCH] drm/sched: fix the bug of time out calculation(v3)
Christian König
ckoenig.leichtzumerken at gmail.com
Fri Aug 27 14:29:34 UTC 2021
I don't think that makes sense.
See we don't want to start the time when the job is inserted into the
ring buffer, but rather when it starts processing.
Starting processing is a bit swampy defined, but just starting the timer
when the previous job completes should be fine enough.
Christian.
Am 27.08.21 um 15:57 schrieb Andrey Grodzovsky:
> The TS represents the point in time when the job was inserted into the
> pending list.
> I don't think it matters when it actually starts to be processed, what
> matters is when this job was inserted into pending list because right
> at that point you arm the TO timer (when no other is running already)
> and so when the previous job completes and you cancel and rearm again
> you can use that TS from the next job in pending list to calculate how
> much time has actually left for it to run before TDR must be initiated
> and not just give it again full TO value to run even if it has already
> been running for a while.
>
> Also, i am not sure also about the assumption that what we measure is
> processing by HW, what we measure is from the moment it was scheduled
> to ring to the moment the job completed (EOP event). At least that
> what our TDR timer measures and so it makes sense to set the TS at
> this point.
>
> Andrey
>
> On 2021-08-27 3:20 a.m., Liu, Monk wrote:
>> [AMD Official Use Only]
>>
>> what is that 'ts' representing for ? it looks to me the jiffies that
>> it get scheduled to the ring, but a job scheduled to the ring
>> doesn't represent it's being processed by hw.
>>
>> Thanks
>>
>> ------------------------------------------
>> Monk Liu | Cloud-GPU Core team
>> ------------------------------------------
>>
>> -----Original Message-----
>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>> Sent: Friday, August 27, 2021 4:14 AM
>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org;
>> Koenig, Christian <Christian.Koenig at amd.com>
>> Cc: dri-devel at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)
>>
>> Attached quick patch for per job TTL calculation to make more
>> precises next timer expiration. It's on top of the patch in this
>> thread. Let me know if this makes sense.
>>
>> Andrey
>>
>> On 2021-08-26 10:03 a.m., Andrey Grodzovsky wrote:
>>> On 2021-08-26 12:55 a.m., Monk Liu wrote:
>>>> issue:
>>>> in cleanup_job the cancle_delayed_work will cancel a TO timer even
>>>> the its corresponding job is still running.
>>>>
>>>> fix:
>>>> do not cancel the timer in cleanup_job, instead do the cancelling
>>>> only when the heading job is signaled, and if there is a "next" job
>>>> we start_timeout again.
>>>>
>>>> v2:
>>>> further cleanup the logic, and do the TDR timer cancelling if the
>>>> signaled job is the last one in its scheduler.
>>>>
>>>> v3:
>>>> change the issue description
>>>> remove the cancel_delayed_work in the begining of the cleanup_job
>>>> recover the implement of drm_sched_job_begin.
>>>>
>>>> TODO:
>>>> 1)introduce pause/resume scheduler in job_timeout to serial the
>>>> handling of scheduler and job_timeout.
>>>> 2)drop the bad job's del and insert in scheduler due to above
>>>> serialization (no race issue anymore with the serialization)
>>>>
>>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>>> ---
>>>> drivers/gpu/drm/scheduler/sched_main.c | 25
>>>> ++++++++++---------------
>>>> 1 file changed, 10 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index a2a9536..ecf8140 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -676,13 +676,7 @@ drm_sched_get_cleanup_job(struct
>>>> drm_gpu_scheduler *sched)
>>>> {
>>>> struct drm_sched_job *job, *next;
>>>> - /*
>>>> - * Don't destroy jobs while the timeout worker is running OR
>>>> thread
>>>> - * is being parked and hence assumed to not touch pending_list
>>>> - */
>>>> - if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>> - !cancel_delayed_work(&sched->work_tdr)) ||
>>>> - kthread_should_park())
>>>> + if (kthread_should_park())
>>>> return NULL;
>>>
>>> I actually don't see why we need to keep the above, on the other side
>>> (in drm_sched_stop) we won't touch the pending list anyway until sched
>>> thread came to full stop (kthread_park). If you do see a reason why
>>> this needed then a comment should be here i think.
>>>
>>> Andrey
>>>
>>>
>>>> spin_lock(&sched->job_list_lock);
>>>> @@ -693,17 +687,21 @@ drm_sched_get_cleanup_job(struct
>>>> drm_gpu_scheduler *sched)
>>>> if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>>>> /* remove job from pending_list */
>>>> list_del_init(&job->list);
>>>> +
>>>> + /* cancel this job's TO timer */
>>>> + cancel_delayed_work(&sched->work_tdr);
>>>> /* make the scheduled timestamp more accurate */
>>>> next = list_first_entry_or_null(&sched->pending_list,
>>>> typeof(*next), list);
>>>> - if (next)
>>>> +
>>>> + if (next) {
>>>> next->s_fence->scheduled.timestamp =
>>>> job->s_fence->finished.timestamp;
>>>> -
>>>> + /* start TO timer for next job */
>>>> + drm_sched_start_timeout(sched);
>>>> + }
>>>> } else {
>>>> job = NULL;
>>>> - /* queue timeout for next job */
>>>> - drm_sched_start_timeout(sched);
>>>> }
>>>> spin_unlock(&sched->job_list_lock);
>>>> @@ -791,11 +789,8 @@ static int drm_sched_main(void *param)
>>>> (entity = drm_sched_select_entity(sched))) ||
>>>> kthread_should_stop());
>>>> - if (cleanup_job) {
>>>> + if (cleanup_job)
>>>> sched->ops->free_job(cleanup_job);
>>>> - /* queue timeout for next job */
>>>> - drm_sched_start_timeout(sched);
>>>> - }
>>>> if (!entity)
>>>> continue;
More information about the dri-devel
mailing list