[Intel-gfx] [PATCH v3 01/20] drm/sched: entity->rq selection cannot fail

Fri Jul 9 07:14:25 UTC 2021

On Fri, Jul 9, 2021 at 8:53 AM Christian König <christian.koenig at amd.com> wrote:
> Am 08.07.21 um 19:37 schrieb Daniel Vetter:
> > If it does, someone managed to set up a sched_entity without
> > schedulers, which is just a driver bug.
>
> NAK, it is perfectly valid for rq selection to fail.

There isn't a better way to explain stuff to someone who's new to the
code and tries to improve it with docs than to NAK stuff with
incomplete explanations?

> See drm_sched_pick_best():
>
>                  if (!sched->ready) {
>                          DRM_WARN("scheduler %s is not ready, skipping",
>                                   sched->name);
>                          continue;
>                  }
>
> This can happen when a device reset fails for some engine.

Well yeah I didn't expect amdgpu to just change this directly, so I
didn't find it. Getting an ENOENT on a hw failure instead of an EIO is
a bit interesting semantics I guess, also what happens with the jobs
which raced against the scheduler not being ready? I'm not seeing any
checks for ready in the main scheduler logic so this at least looks
somewhat accidental as a side effect, also no other driver than amdgpu
communitcates that reset failed back to drm/sched like this. They seem
to just not, and I guess timeout on the next request will get us into
an endless reset loop?
-Daniel

>
> Regards,
> Christian.
>
> >
> > We BUG_ON() here because in the next patch drm_sched_job_init() will
> > be split up, with drm_sched_job_arm() never failing. And that's the
> > part where the rq selection will end up in.
> >
> > Note that if having an empty sched_list set on an entity is indeed a
> > valid use-case, we can keep that check in job_init even after the split
> > into job_init/arm.
> >
> > Signed-off-by: Daniel Vetter <daniel.vetter at intel.com>
> > Cc: "Christian König" <christian.koenig at amd.com>
> > Cc: Luben Tuikov <luben.tuikov at amd.com>
> > Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> > Cc: Steven Price <steven.price at arm.com>
> > Cc: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> > Cc: Boris Brezillon <boris.brezillon at collabora.com>
> > Cc: Jack Zhang <Jack.Zhang1 at amd.com>
> > ---
> >   drivers/gpu/drm/scheduler/sched_entity.c | 2 +-
> >   drivers/gpu/drm/scheduler/sched_main.c   | 3 +--
> >   2 files changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > index 79554aa4dbb1..6fc116ee7302 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -45,7 +45,7 @@
> >    * @guilty: atomic_t set to 1 when a job on this queue
> >    *          is found to be guilty causing a timeout
> >    *
> > - * Note: the sched_list should have at least one element to schedule
> > + * Note: the sched_list must have at least one element to schedule
> >    *       the entity
> >    *
> >    * Returns 0 on success or a negative error code on failure.
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 33c414d55fab..01dd47154181 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -586,8 +586,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
> >       struct drm_gpu_scheduler *sched;
> >
> >       drm_sched_entity_select_rq(entity);
> > -     if (!entity->rq)
> > -             return -ENOENT;
> > +     BUG_ON(!entity->rq);
> >
> >       sched = entity->rq->sched;
> >
>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch