lima_bo memory leak after drm_sched job destruction rework

Fri May 17 20:43:15 UTC 2019

On 5/17/19 3:35 PM, Erico Nunes wrote:
> [CAUTION: External Email]
>
> Hello,
>
> I have recently observed a memory leak issue with lima using
> drm-misc-next, which I initially reported here:
> https://gitlab.freedesktop.org/lima/linux/issues/24
> It is an easily reproduceable memory leak which I was able to bisect to commit:
>
> 5918045c4ed4 drm/scheduler: rework job destruction
>
> After some investigation, it seems that after the refactor,
> sched->ops->free_job (in lima: lima_sched_free_job) is no longer
> called.
> With some more debugging I found that it is not being called because
> the job freeing is now in drm_sched_cleanup_jobs, which for lima
> always aborts in the initial "Don't destroy jobs while the timeout
> worker is running" condition.
>
> Lima currently defaults to an "infinite" timeout. Setting a 500ms
> default timeout like most other drm_sched users do fixed the leak for
> me.

I am not very clear about the problem - so you basically never allow a 
time out handler to run ? And then when the job hangs for ever you get 
this memory leak ? How it worked for you before this refactoring ? As 
far as I remember  sched->ops->free_job before this change was called 
from drm_sched_job_finish which is the work scheduled from HW fence 
signaled callback - drm_sched_process_job so if your job hangs for ever 
anyway and this work is never scheduled  when your free_job callback was 
called ?

>
> I can send a patch to set a 500ms timeout and have it probably working
> again, but I am wondering now if this is expected behaviour for
> drm_sched after the refactor.
> In particular I also noticed that drm_sched_suspend_timeout is not
> called anywhere. Is it expected that we now rely on a timeout
> parameter to cleanup jobs that ran successfully?

AFAIK the drm_sched_suspend_timeout is used by a driver in a staging 
branch, Christian can give more detail.

Andrey

>
> Erico