[PATCH 12/13] drm/scheduler: rework entity flush, kill and fini

Wed Dec 28 16:52:13 UTC 2022

On Wed, Dec 28, 2022 at 8:27 AM Rob Clark <robdclark at gmail.com> wrote:
>
> On Thu, Nov 17, 2022 at 7:12 AM Dmitry Osipenko
> <dmitry.osipenko at collabora.com> wrote:
> >
> > On 11/17/22 18:09, Christian König wrote:
> > > Am 17.11.22 um 15:41 schrieb Dmitry Osipenko:
> > >> [SNIP]
> > >>> drm_sched_entity_flush() should be called from the flush callback from
> > >>> the file_operations structure of panfrost. See amdgpu_flush() and
> > >>> amdgpu_ctx_mgr_entity_flush(). This makes sure that we wait for all
> > >>> entities of the process/file descriptor to be flushed out.
> > >>>
> > >>> drm_sched_entity_fini() must be called before you free the memory the
> > >>> entity structure or otherwise we would run into an use after free.
> > >> Right, drm_sched_entity_destroy() invokes these two functions and
> > >> Panfrost uses drm_sched_entity_destroy().
> > >
> > > Than I have no idea what's going wrong here.
> > >
> > > The scheduler should trivially finish with the entity and call
> > > complete(&entity->entity_idle) in it's main loop. No idea why this
> > > doesn't happen. Can you investigate?
> >
> > I'll take a closer look. Hoped you may have a quick idea of what's wrong :)
> >
>
> As Jonathan mentioned, the same thing is happening on msm.  I can
> reproduce this by adding an assert in mesa (in this case, triggered
> after 100 draws) and running an app under gdb.  After the assert is
> hit, if I try to exit mesa, it hangs.
>
> The problem is that we somehow call drm_sched_entity_kill() twice.
> The first time completes, but now the entity_idle completion is no
> longer done, so the second call hangs forever.

Maybe we should:

------

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index fe09e5be79bd..3d7c671d05e3 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -222,7 +226,6 @@ static void drm_sched_entity_kill(struct
drm_sched_entity *entity)
 long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
 {
        struct drm_gpu_scheduler *sched;
-       struct task_struct *last_user;
        long ret = timeout;

        if (!entity->rq)
@@ -244,12 +247,6 @@ long drm_sched_entity_flush(struct
drm_sched_entity *entity, long timeout)
                                    drm_sched_entity_is_idle(entity));
        }

-       /* For killed process disable any more IBs enqueue right now */
-       last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
-       if ((!last_user || last_user == current->group_leader) &&
-           (current->flags & PF_EXITING) && (current->exit_code == SIGKILL))
-               drm_sched_entity_kill(entity);
-
        return ret;
 }
 EXPORT_SYMBOL(drm_sched_entity_flush);
----

Maybe there is a better fix, but special handling for SIGKILL seems
dubious to me (vs just relying on the drm device fd close path).  I
wonder if that code path was tested at all?

BR,
-R