lima_bo memory leak after drm_sched job destruction rework

Mon May 20 06:31:55 UTC 2019

The problem is simply that we only delete the jobs when we were able to 
cancel the timeout handler.

Now what happens is that the timeout handler was never started in the 
first place, so we can't cancel it. Just adding a simple "if 
(sched->timeout != MAX_SCHEDULE_TIMEOUT &&...." should do the trick.

Regards,
Christian.

Am 18.05.19 um 01:50 schrieb Grodzovsky, Andrey:
> Don't have the code in front of me now but as far as I remember it will only prematurely terminate in drm_sched_cleanup_jobs if there is timeout work in progress which would not be the case if nothing hangs.
>
> Andrey
>
> ________________________________________
> From: Erico Nunes <nunes.erico at gmail.com>
> Sent: 17 May 2019 17:42:48
> To: Grodzovsky, Andrey
> Cc: Deucher, Alexander; Koenig, Christian; Zhou, David(ChunMing); David Airlie; Daniel Vetter; Lucas Stach; Russell King; Christian Gmeiner; Qiang Yu; Rob Herring; Tomeu Vizoso; Eric Anholt; Rex Zhu; Huang, Ray; Deng, Emily; Nayan Deshmukh; Sharat Masetty; amd-gfx at lists.freedesktop.org; dri-devel at lists.freedesktop.org; lima at lists.freedesktop.org
> Subject: Re: lima_bo memory leak after drm_sched job destruction rework
>
> [CAUTION: External Email]
>
> On Fri, May 17, 2019 at 10:43 PM Grodzovsky, Andrey
> <Andrey.Grodzovsky at amd.com> wrote:
>> On 5/17/19 3:35 PM, Erico Nunes wrote:
>>> Lima currently defaults to an "infinite" timeout. Setting a 500ms
>>> default timeout like most other drm_sched users do fixed the leak for
>>> me.
>> I am not very clear about the problem - so you basically never allow a
>> time out handler to run ? And then when the job hangs for ever you get
>> this memory leak ? How it worked for you before this refactoring ? As
>> far as I remember  sched->ops->free_job before this change was called
>> from drm_sched_job_finish which is the work scheduled from HW fence
>> signaled callback - drm_sched_process_job so if your job hangs for ever
>> anyway and this work is never scheduled  when your free_job callback was
>> called ?
> In this particular case, the jobs run successfully, nothing hangs.
> Lima currently specifies an "infinite" timeout to the drm scheduler,
> so if a job did did hang, it would hang forever, I suppose. But this
> is not the issue.
>
> If I understand correctly it worked well before the rework because the
> cleanup was triggered at the end of drm_sched_process_job
> independently on the timeout.
>
> What I'm observing now is that even when jobs run successfully, they
> are not cleaned by the drm scheduler because drm_sched_cleanup_jobs
> seems to give up based on the status of a timeout worker.
> I would expect the timeout value to only be relevant in error/hung job cases.
>
> I will probably set the timeout to a reasonable value anyway, I just
> posted here to report that this can possibly be a bug in the drm
> scheduler after that rework.