[PATCH v3 10/13] drm/sched: Add helper to set TDR timeout
Matthew Brost
matthew.brost at intel.com
Thu Sep 14 17:36:27 UTC 2023
On Wed, Sep 13, 2023 at 10:38:24PM -0400, Luben Tuikov wrote:
> On 2023-09-11 22:16, Matthew Brost wrote:
> > Add helper to set TDR timeout and restart the TDR with new timeout
> > value. This will be used in XE, new Intel GPU driver, to trigger the TDR
> > to cleanup drm_sched_entity that encounter errors.
>
> Do you just want to trigger the cleanup or do you really want to
> modify the timeout and requeue TDR delayed work (to be triggered
> later at a different time)?
>
> If the former, then might as well just add a function to queue that
> right away. If the latter, then this would do, albeit with a few
Let me change the function to queue it immediately as that is what is
needed.
Matt
> notes as mentioned below.
>
> >
> > Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > ---
> > drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
> > include/drm/gpu_scheduler.h | 1 +
> > 2 files changed, 19 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 9dbfab7be2c6..689fb6686e01 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -426,6 +426,24 @@ static void drm_sched_start_timeout_unlocked(struct drm_gpu_scheduler *sched)
> > spin_unlock(&sched->job_list_lock);
> > }
> >
> > +/**
> > + * drm_sched_set_timeout - set timeout for reset worker
> > + *
> > + * @sched: scheduler instance to set and (re)-start the worker for
> > + * @timeout: timeout period
> > + *
> > + * Set and (re)-start the timeout for the given scheduler.
> > + */
> > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
> > +{
>
> Well, I'd perhaps call this "drm_sched_set_tdr_timeout()", or something
> to that effect, as "drm_sched_set_timeout()" isn't clear that it is indeed
> a cleanup timeout. However, it's totally up to you. :-)
>
> It appears that "long timeout" is the new job timeout, so it is possible
> that a stuck job might be given old timeout + new timeout recovery time,
> after this function is called.
>
> > + spin_lock(&sched->job_list_lock);
> > + sched->timeout = timeout;
> > + cancel_delayed_work(&sched->work_tdr);
> > + drm_sched_start_timeout(sched);
> > + spin_unlock(&sched->job_list_lock);
> > +}
> > +EXPORT_SYMBOL(drm_sched_set_timeout);
>
> Well, drm_sched_start_timeout() (which also has a name lacking description, perhaps
> it should be "drm_sched_start_tdr_timeout()" or "...start_cleanup_timeout()"), anyway,
> so that function compares to MAX_SCHEDULE_TIMEOUT and pending list not being empty
> before it requeues delayed TDR work item. So, while a remote possibility, this new
> function may have the unintended consequence of canceling TDR, and never restarting it.
> I see it grabs the lock, however. Maybe wrap it in "if (sched->timeout != MAX_SCHEDULE_TIMEOUT)"?
> How about using mod_delayed_work()?
> --
> Regards,
> Luben
>
> > +
> > /**
> > * drm_sched_fault - immediately start timeout handler
> > *
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 5d753ecb5d71..b7b818cd81b6 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -596,6 +596,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> > struct drm_gpu_scheduler **sched_list,
> > unsigned int num_sched_list);
> >
> > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
> > void drm_sched_job_cleanup(struct drm_sched_job *job);
> > void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> > void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>
More information about the dri-devel
mailing list