lima_bo memory leak after drm_sched job destruction rework

Fri May 17 23:50:32 UTC 2019

Don't have the code in front of me now but as far as I remember it will only prematurely terminate in drm_sched_cleanup_jobs if there is timeout work in progress which would not be the case if nothing hangs.

Andrey

________________________________________
From: Erico Nunes <nunes.erico at gmail.com>
Sent: 17 May 2019 17:42:48
To: Grodzovsky, Andrey
Cc: Deucher, Alexander; Koenig, Christian; Zhou, David(ChunMing); David Airlie; Daniel Vetter; Lucas Stach; Russell King; Christian Gmeiner; Qiang Yu; Rob Herring; Tomeu Vizoso; Eric Anholt; Rex Zhu; Huang, Ray; Deng, Emily; Nayan Deshmukh; Sharat Masetty; amd-gfx at lists.freedesktop.org; dri-devel at lists.freedesktop.org; lima at lists.freedesktop.org
Subject: Re: lima_bo memory leak after drm_sched job destruction rework

[CAUTION: External Email]

On Fri, May 17, 2019 at 10:43 PM Grodzovsky, Andrey
<Andrey.Grodzovsky at amd.com> wrote:
> On 5/17/19 3:35 PM, Erico Nunes wrote:
> > Lima currently defaults to an "infinite" timeout. Setting a 500ms
> > default timeout like most other drm_sched users do fixed the leak for
> > me.
>
> I am not very clear about the problem - so you basically never allow a
> time out handler to run ? And then when the job hangs for ever you get
> this memory leak ? How it worked for you before this refactoring ? As
> far as I remember  sched->ops->free_job before this change was called
> from drm_sched_job_finish which is the work scheduled from HW fence
> signaled callback - drm_sched_process_job so if your job hangs for ever
> anyway and this work is never scheduled  when your free_job callback was
> called ?

In this particular case, the jobs run successfully, nothing hangs.
Lima currently specifies an "infinite" timeout to the drm scheduler,
so if a job did did hang, it would hang forever, I suppose. But this
is not the issue.

If I understand correctly it worked well before the rework because the
cleanup was triggered at the end of drm_sched_process_job
independently on the timeout.

What I'm observing now is that even when jobs run successfully, they
are not cleaned by the drm scheduler because drm_sched_cleanup_jobs
seems to give up based on the status of a timeout worker.
I would expect the timeout value to only be relevant in error/hung job cases.

I will probably set the timeout to a reasonable value anyway, I just
posted here to report that this can possibly be a bug in the drm
scheduler after that rework.