[Intel-gfx] [PATCH v1] drm/i915: Fix a false alert of memory leak when free LRC

Tue Nov 24 02:46:21 PST 2015

On 23/11/15 22:30, Yu Dai wrote:
> On 11/23/2015 02:34 AM, Tvrtko Ursulin wrote:
>> On 20/11/15 08:31, Daniel Vetter wrote:
>> > On Thu, Nov 19, 2015 at 04:10:26PM -0800, yu.dai at intel.com wrote:
>> >> From: Alex Dai <yu.dai at intel.com>
>> >>
>> >> There is a memory leak warning message from i915_gem_context_clean
>> >> when GuC submission is enabled. The reason is that when LRC is
>> >> released, its ppgtt could be still referenced. The assumption that
>> >> all VMAs are unbound during release of LRC is not true.
>> >>
>> >> v1: Move the code inside i915_gem_context_clean() to where ppgtt is
>> >> released because it is not cleaning context anyway but ppgtt.
>> >>
>> >> Signed-off-by: Alex Dai <yu.dai at intel.com>
>> >
>> > retire__read drops the ctx (and hence ppgtt) reference too early,
>> > resulting in us hitting the WARNING. See the giant thread with Tvrtko,
>> > Chris and me:
>> >
>> > http://www.spinics.net/lists/intel-gfx/msg78918.html
>> >
>> > Would be great if someone could test the diff I posted in there.
>>
>> It doesn't work - I have posted my IGT snippet which I thought
>> explained it.
>
> I thought moving the VMA list clean up into i915_ppgtt_release() should
> work. However, it creates a chicken & egg problem. ppgtt_release() rely
> on vma_unbound() to be called first to decrease its refcount. So calling
> vma_unbound() inside ppgtt_release() is not right.
>> Problem req unreference in obj->active case. When it hits that path it
>> will not move the VMA to inactive and the
>> intel_execlists_retire_requests will be the last unreference from the
>> retire worker which will trigger the WARN.
>
> I still think the problem comes from the assumption that when lrc is
> released, its all VMAs should be unbound. Precisely I mean the comments
> made for i915_gem_context_clean() - "This context is going away and we
> need to remove all VMAs still around." Really the lrc life cycle is
> different from ppgtt / VMAs. Check the line after
> i915_gem_context_clean(). It is ppgtt_put(). In the case lrc is freed
> early, It won't release ppgtt anyway because it is still referenced by
> VMAs. An it will be freed when no ref of GEM obj.

Yes, but the last is just a consequence of how the code was laid out, 
not a hard requirement as far as I understand it.

When context destructor runs that means two things:

1. GPU is done with all VMAs belonging to the VM.
2. USerspace also cannot reach them any more either.

So VMAs have no reason to exist beyond that point. If they do, they can 
be left dangling under certain circumstances. See below.

>> I posted an IGT which hits that ->
>> http://patchwork.freedesktop.org/patch/65369/
>>
>> And posted one give up on the active VMA mem leak patch ->
>> http://patchwork.freedesktop.org/patch/65529/
>
> This patch will silent the warning. But I think the
> i915_gem_context_clean() itself is unnecessary. I don't see any issue by
> deleting it. The check of VMA list is inside ppgtt_release() and the
> unbound should be aligned to GEM obj's life cycle but not lrc life cycle.

i915_gem_context_clean solves a memory leak which is explained in the 
commit.

If there is an extra reference on the GEM obj, like in the flink case, 
VMAs will not get unbound.

Apart from various IGTs, you can also demonstrate this leak with fbcon 
and Xorg. Every time you re-start Xorg one VMA will be leaked since it 
imports the fbcon fb via flink.

Second part of the story is that the WARN in i915_gem_context_clean is 
wrong because of how retirement works. So if we removed the WARN, 
cleanup still has some value to avoid memory leak in the above described 
Xorg case, or in any case where flink is in the picture and VMAs have 
been retired to inactive.

Bug left standing would be that if the VMA happens to be flinked and on 
the active list, because of say being shared between rings and contexts, 
it would still be leaked. But it is less leaking than without the cleanup.

I proposed another way of fixing that: 
http://patchwork.freedesktop.org/patch/65861/

>> I have no idea yet of GuC implications, I just spotted this parallel
>> thread.
>>
>> And Mika has proposed something interesting - that we could just clean
>> up the active VMA in context cleanup since we know it is a false one.
>>
>> However, again I don't know how that interacts with the GuC. Surely it
>> cannot be freeing the context with stuff genuinely still active in the
>> GuC?
>>
>
> There is no interacts with GuC though. Just very easy to see the warning
> when GuC is enabled, says when run gem_close_race. The reason is that
> GuC does not use the execlist_queue (execlist_retired_req_list) which is
> deferred to retire worker. Same as ring submission mode, when GuC is
> enabled, whenever driver submits a new batch, it will try to release
> previous request. I don't know why  intel_execlists_retire_requests is
> not called for this case. Probably because of the unpin. Deferring the
> retirement may just hide the issue. I bet you will see the warning more
> often if you change i915_gem_retire_requests_ring() to
> i915_gem_retire_requests() in i915_gem_execbuffer_reserve().

Hmmm.. gem_close_race uses flink, but why would it trigger context 
destruction with active VMAs? How does the backtrace looks like from the 
context cleanup WARN?

Regards,

Tvrtko