lima_bo memory leak after drm_sched job destruction rework

Fri May 17 21:42:48 UTC 2019

On Fri, May 17, 2019 at 10:43 PM Grodzovsky, Andrey
<Andrey.Grodzovsky at amd.com> wrote:
> On 5/17/19 3:35 PM, Erico Nunes wrote:
> > Lima currently defaults to an "infinite" timeout. Setting a 500ms
> > default timeout like most other drm_sched users do fixed the leak for
> > me.
>
> I am not very clear about the problem - so you basically never allow a
> time out handler to run ? And then when the job hangs for ever you get
> this memory leak ? How it worked for you before this refactoring ? As
> far as I remember  sched->ops->free_job before this change was called
> from drm_sched_job_finish which is the work scheduled from HW fence
> signaled callback - drm_sched_process_job so if your job hangs for ever
> anyway and this work is never scheduled  when your free_job callback was
> called ?

In this particular case, the jobs run successfully, nothing hangs.
Lima currently specifies an "infinite" timeout to the drm scheduler,
so if a job did did hang, it would hang forever, I suppose. But this
is not the issue.

If I understand correctly it worked well before the rework because the
cleanup was triggered at the end of drm_sched_process_job
independently on the timeout.

What I'm observing now is that even when jobs run successfully, they
are not cleaned by the drm scheduler because drm_sched_cleanup_jobs
seems to give up based on the status of a timeout worker.
I would expect the timeout value to only be relevant in error/hung job cases.

I will probably set the timeout to a reasonable value anyway, I just
posted here to report that this can possibly be a bug in the drm
scheduler after that rework.