[PATCH 1/1] drm/lima: implement the file flush callback

Thu Apr 10 14:04:18 UTC 2025

Am 10.04.25 um 15:56 schrieb Qiang Yu:
> On Thu, Apr 10, 2025 at 8:35 PM Christian König
> <christian.koenig at amd.com> wrote:
>> Am 10.04.25 um 11:33 schrieb Qiang Yu:
>>> On Tue, Apr 8, 2025 at 11:48 PM Erico Nunes <nunes.erico at gmail.com> wrote:
>>>> With this callback implemented, a terminating application will wait for
>>>> the sched entity to be flushed out to the hardware and cancel all other
>>>> pending jobs before destroying its context.
>>> We do flush when file release in lima_ctx_mgr_fini. Why do we wait here
>>> in flush? What's the difference?
>> Waiting for submissions when you release the file descriptor is actually a bad idea since that can prevent SIGKILL from acting immediately. For example the OOM killer absolutely doesn't like that and eventually calls panic().
>>
>> Flush is called either manually, on process termination or when you send a SIGTERM. This should then wait for any I/O to complete.
>>
>> The idea is now that you can then still send a SIGKILL to abort waiting for the I/O as well and so get pending GPU operations not submitted to the HW.
>>
>> The DRM scheduler helps doing that by providing the drm_sched_entity_flush() and drm_sched_entity_fini() functions.
>>
>> When there is still pending work when drm_sched_entity_fini() is called a callback to kill it is installed and the job just freed instead of executed.
>>
> So the difference is we can always wait in flush, but better not in
> release when SIGKILL?

Exactly that, yes.

>
>>>> This prevents applications with multiple contexts from running into a
>>>> race condition between running tasks and context destroy when
>>>> terminating.
>>>>
> If implementing flush callback fix the problem, it must not be when SIGKILL.

SIGKILL also calls flush (again eventually), but we can detect that in the scheduler code.

See the code and comment in drm_sched_entity_flush(). We could potentially improve the comment cause it's not 100% correct, but it should work under all cases.

> Could you describe the problem and how this fix solves it? As I don't understand
> how the above difference could fix a race bug.

That was the point I wasn't sure about either. It should *not* fix any race, it's just gives a nicer SIGKILL experience.

Regards,
Christian.

>
>>>> Signed-off-by: Erico Nunes <nunes.erico at gmail.com>
>>>> ---
>>>>  drivers/gpu/drm/lima/lima_ctx.c | 18 ++++++++++++++++++
>>>>  drivers/gpu/drm/lima/lima_ctx.h |  1 +
>>>>  drivers/gpu/drm/lima/lima_drv.c | 17 ++++++++++++++++-
>>>>  3 files changed, 35 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/lima/lima_ctx.c b/drivers/gpu/drm/lima/lima_ctx.c
>>>> index 0e668fc1e0f9..e8fb5788ca69 100644
>>>> --- a/drivers/gpu/drm/lima/lima_ctx.c
>>>> +++ b/drivers/gpu/drm/lima/lima_ctx.c
>>>> @@ -100,3 +100,21 @@ void lima_ctx_mgr_fini(struct lima_ctx_mgr *mgr)
>>>>         xa_destroy(&mgr->handles);
>>>>         mutex_destroy(&mgr->lock);
>>>>  }
>>>> +
>>>> +long lima_ctx_mgr_flush(struct lima_ctx_mgr *mgr, long timeout)
>>>> +{
>>>> +       struct lima_ctx *ctx;
>>>> +       unsigned long id;
>>>> +
>>>> +       mutex_lock(&mgr->lock);
>>>> +       xa_for_each(&mgr->handles, id, ctx) {
>>>> +               for (int i = 0; i < lima_pipe_num; i++) {
>>>> +                       struct lima_sched_context *context = &ctx->context[i];
>>>> +                       struct drm_sched_entity *entity = &context->base;
>>>> +
>>>> +                       timeout = drm_sched_entity_flush(entity, timeout);
>>>> +               }
>>>> +       }
>>>> +       mutex_unlock(&mgr->lock);
>>>> +       return timeout;
>>>> +}
>>>> diff --git a/drivers/gpu/drm/lima/lima_ctx.h b/drivers/gpu/drm/lima/lima_ctx.h
>>>> index 5b1063ce968b..ff133db6ae4c 100644
>>>> --- a/drivers/gpu/drm/lima/lima_ctx.h
>>>> +++ b/drivers/gpu/drm/lima/lima_ctx.h
>>>> @@ -30,5 +30,6 @@ struct lima_ctx *lima_ctx_get(struct lima_ctx_mgr *mgr, u32 id);
>>>>  void lima_ctx_put(struct lima_ctx *ctx);
>>>>  void lima_ctx_mgr_init(struct lima_ctx_mgr *mgr);
>>>>  void lima_ctx_mgr_fini(struct lima_ctx_mgr *mgr);
>>>> +long lima_ctx_mgr_flush(struct lima_ctx_mgr *mgr, long timeout);
>>>>
>>>>  #endif
>>>> diff --git a/drivers/gpu/drm/lima/lima_drv.c b/drivers/gpu/drm/lima/lima_drv.c
>>>> index 11ace5cebf4c..08169b0d9c28 100644
>>>> --- a/drivers/gpu/drm/lima/lima_drv.c
>>>> +++ b/drivers/gpu/drm/lima/lima_drv.c
>>>> @@ -254,7 +254,22 @@ static const struct drm_ioctl_desc lima_drm_driver_ioctls[] = {
>>>>         DRM_IOCTL_DEF_DRV(LIMA_CTX_FREE, lima_ioctl_ctx_free, DRM_RENDER_ALLOW),
>>>>  };
>>>>
>>>> -DEFINE_DRM_GEM_FOPS(lima_drm_driver_fops);
>>>> +static int lima_drm_driver_flush(struct file *filp, fl_owner_t id)
>>>> +{
>>>> +       struct drm_file *file = filp->private_data;
>>>> +       struct lima_drm_priv *priv = file->driver_priv;
>>>> +       long timeout = MAX_WAIT_SCHED_ENTITY_Q_EMPTY;
>>>> +
>>>> +       timeout = lima_ctx_mgr_flush(&priv->ctx_mgr, timeout);
>>>> +
>>>> +       return timeout >= 0 ? 0 : timeout;
>>>> +}
>>>> +
>>>> +static const struct file_operations lima_drm_driver_fops = {
>>>> +       .owner = THIS_MODULE,
>>>> +       .flush = lima_drm_driver_flush,
>>>> +       DRM_GEM_FOPS,
>>>> +};
>>>>
>>>>  /*
>>>>   * Changelog:
>>>> --
>>>> 2.49.0
>>>>