[PATCH 2/2] drm/sched: fix timeout handling

Mon Oct 8 18:01:43 UTC 2018

Thanks for all the explanations. Looks like this part of scheduler has
a lot of bugs.

On Tue, Oct 9, 2018 at 2:55 AM Christian König
<ckoenig.leichtzumerken at gmail.com> wrote:
>
> Am 08.10.2018 um 19:40 schrieb Nayan Deshmukh:
> > Hi Christian,
> >
> > I am a bit confused with this patch. It will be better if you can
> > explain what these 3 functions are supposed to do. From my
> > understanding this is what they are supposed to do:
> >
> > 1. drm_sched_job_timedout: This function is called when a job has been
> > executing for more than "timeout" seconds. This function needs to
> > identify the wrong job and call the timedout_job function which will
> > notify the driver of that job and the gpu should be recovered to its
> > original state.
> >
> > 2. drm_sched_hw_job_reset: stop the scheduler if it contains the bad
> > job. ATM this function first removes all the callback and goes through
> > the ring_mirror_list afterward to find the bad job. I think it should
> > do it in the opposite order. Or my understanding of this function is
> > not fully correct.
>
> That function actually has a couple of bugs itself, but it is irrelevant
> for the current patch.
>
> > 3. drm_sched_job_recovery: recover jobs after a reset. It resubmits
> > the job to the hardware queue avoiding the guilty job.
>
> That function is completely unrelated and only called after recovering
> from a hard reset.
>
> >
> > Some other questions and suggestions:
> >
> > 1. We can avoid the race condition altogether if we shift the
> > canceling and rescheduling of the timedout work item to the
> > drm_sched_process_job().
>
> That won't work. drm_sched_process_job() is called from interrupt
> context and can't sync with a timeout worker.
>
> > 2. How does removing and adding the callback help with the race
> > condition? Moreover, the hardware might execute some jobs while we are
> > in this function leading to more races.
>
> We need to make sure that the underlying hardware fence doesn't signal
> and triggers new processing while we are about to call the drivers
> timeout function to reset the hardware.
>
> This is done by removing the callbacks in reverse order (e.g. newest to
> oldest).
>
> If we find that we can't remove the callback because the hardware has
> actually continued and the fence has already signaled we add back the
> callback again in normal order (e.g. oldest to newest) starting from the
> job which was already  signaled.
Why do we need to call drm_sched_process_job() again for the last signaled job?

Regards,
Nayan
>
> >
> > 3. What is the order in which the above 3 functions should be executed
> > by the hardware? I think the answer to this question might clear a lot
> > of my doubts.
>
> If we can quickly recover from a problem only the
> drm_sched_job_timedout() should be execute.
>
> The other two are for hard resets where we need to stop multiple
> scheduler instances and get them running again.
>
> That the karma handling is mixed into that is rather unfortunate and
> actually quite buggy as well.
>
> I should probably also clean that up.
>
> Regards,
> Christian.
>
> >
> > Regards,
> > Nayan
> >
> > On Mon, Oct 8, 2018 at 8:36 PM Christian König
> > <ckoenig.leichtzumerken at gmail.com> wrote:
> >> We need to make sure that we don't race between job completion and
> >> timeout.
> >>
> >> Signed-off-by: Christian König <christian.koenig at amd.com>
> >> ---
> >>   drivers/gpu/drm/scheduler/sched_main.c | 28 +++++++++++++++++++++++++++-
> >>   1 file changed, 27 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> >> index bd7d11c47202..ad3c57c9fd21 100644
> >> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >> @@ -248,14 +248,40 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
> >>   static void drm_sched_job_timedout(struct work_struct *work)
> >>   {
> >>          struct drm_gpu_scheduler *sched;
> >> +       struct drm_sched_fence *fence;
> >>          struct drm_sched_job *job;
> >> +       int r;
> >>
> >>          sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> >> +
> >> +       spin_lock(&sched->job_list_lock);
> >> +       list_for_each_entry_reverse(job, &sched->ring_mirror_list, node) {
> >> +               fence = job->s_fence;
> >> +               if (!dma_fence_remove_callback(fence->parent, &fence->cb))
> >> +                       goto already_signaled;
> >> +       }
> >> +
> >>          job = list_first_entry_or_null(&sched->ring_mirror_list,
> >>                                         struct drm_sched_job, node);
> >> +       spin_unlock(&sched->job_list_lock);
> >>
> >>          if (job)
> >> -               job->sched->ops->timedout_job(job);
> >> +               sched->ops->timedout_job(job);
> >> +
> >> +       spin_lock(&sched->job_list_lock);
> >> +       list_for_each_entry(job, &sched->ring_mirror_list, node) {
> >> +               fence = job->s_fence;
> >> +               if (!fence->parent || !list_empty(&fence->cb.node))
> >> +                       continue;
> >> +
> >> +               r = dma_fence_add_callback(fence->parent, &fence->cb,
> >> +                                          drm_sched_process_job);
> >> +               if (r)
> >> +already_signaled:
> >> +                       drm_sched_process_job(fence->parent, &fence->cb);
> >> +
> >> +       }
> >> +       spin_unlock(&sched->job_list_lock);
> >>   }
> >>
> >>   /**
> >> --
> >> 2.14.1
> >>
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel at lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>