[PATCH] drm/sched: fix the bug of time out calculation
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Tue Aug 24 14:46:47 UTC 2021
On 2021-08-24 5:51 a.m., Monk Liu wrote:
> the original logic is wrong that the timeout will not be retriggerd
> after the previous job siganled, and that lead to the scenario that all
> jobs in the same scheduler shares the same timeout timer from the very
> begining job in this scheduler which is wrong.
>
> we should modify the timer everytime a previous job signaled.
>
> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> ---
> drivers/gpu/drm/scheduler/sched_main.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index a2a9536..fb27025 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -235,6 +235,13 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
> schedule_delayed_work(&sched->work_tdr, sched->timeout);
> }
>
> +static void drm_sched_restart_timeout(struct drm_gpu_scheduler *sched)
> +{
> + if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> + !list_empty(&sched->pending_list))
> + mod_delayed_work(system_wq, &sched->work_tdr, sched->timeout);
> +}
> +
> /**
> * drm_sched_fault - immediately start timeout handler
> *
> @@ -693,6 +700,11 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
> if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
> /* remove job from pending_list */
> list_del_init(&job->list);
> +
> + /* once the job deleted from pending list we should restart
> + * the timeout calculation for the next job.
> + */
> + drm_sched_restart_timeout(sched);
I think this should work, but 2 points -
1st you should probably remove this now
https://elixir.bootlin.com/linux/v5.14-rc1/source/drivers/gpu/drm/scheduler/sched_main.c#L797
2nd - if you have two adjacent jobs started very closely you effectively
letting the second job to be twice longer hang without TDR because
you reset TDR timer for it when it's almost expired. If we could have
TTL (time to live counter) for each job and then do mod_delayed_work
to the TTL of the following job instead of just full timer reset then
this would be more precise. But this is more of recommendation for
improvement.
Andrey
> /* make the scheduled timestamp more accurate */
> next = list_first_entry_or_null(&sched->pending_list,
> typeof(*next), list);
More information about the amd-gfx
mailing list