lima_bo memory leak after drm_sched job destruction rework
Andrey.Grodzovsky at amd.com
Fri May 17 20:43:15 UTC 2019
On 5/17/19 3:35 PM, Erico Nunes wrote:
> [CAUTION: External Email]
> I have recently observed a memory leak issue with lima using
> drm-misc-next, which I initially reported here:
> It is an easily reproduceable memory leak which I was able to bisect to commit:
> 5918045c4ed4 drm/scheduler: rework job destruction
> After some investigation, it seems that after the refactor,
> sched->ops->free_job (in lima: lima_sched_free_job) is no longer
> With some more debugging I found that it is not being called because
> the job freeing is now in drm_sched_cleanup_jobs, which for lima
> always aborts in the initial "Don't destroy jobs while the timeout
> worker is running" condition.
> Lima currently defaults to an "infinite" timeout. Setting a 500ms
> default timeout like most other drm_sched users do fixed the leak for
I am not very clear about the problem - so you basically never allow a
time out handler to run ? And then when the job hangs for ever you get
this memory leak ? How it worked for you before this refactoring ? As
far as I remember sched->ops->free_job before this change was called
from drm_sched_job_finish which is the work scheduled from HW fence
signaled callback - drm_sched_process_job so if your job hangs for ever
anyway and this work is never scheduled when your free_job callback was
> I can send a patch to set a 500ms timeout and have it probably working
> again, but I am wondering now if this is expected behaviour for
> drm_sched after the refactor.
> In particular I also noticed that drm_sched_suspend_timeout is not
> called anywhere. Is it expected that we now rely on a timeout
> parameter to cleanup jobs that ran successfully?
AFAIK the drm_sched_suspend_timeout is used by a driver in a staging
branch, Christian can give more detail.
More information about the amd-gfx