[Intel-gfx] [PATCH v4] drm/i915: Clean up associated VMAs on context destruction

Tue Oct 6 02:48:28 PDT 2015

On 06/10/15 10:34, Chris Wilson wrote:
> On Tue, Oct 06, 2015 at 11:28:49AM +0200, Daniel Vetter wrote:
>> On Tue, Oct 06, 2015 at 10:04:31AM +0100, Chris Wilson wrote:
>>> On Tue, Oct 06, 2015 at 10:58:01AM +0200, Daniel Vetter wrote:
>>>> On Mon, Oct 05, 2015 at 01:26:36PM +0100, Tvrtko Ursulin wrote:
>>>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>>
>>>>> Prevent leaking VMAs and PPGTT VMs when objects are imported
>>>>> via flink.
>>>>>
>>>>> Scenario is that any VMAs created by the importer will be left
>>>>> dangling after the importer exits, or destroys the PPGTT context
>>>>> with which they are associated.
>>>>>
>>>>> This is caused by object destruction not running when the
>>>>> importer closes the buffer object handle due the reference held
>>>>> by the exporter. This also leaks the VM since the VMA has a
>>>>> reference on it.
>>>>>
>>>>> In practice these leaks can be observed by stopping and starting
>>>>> the X server on a kernel with fbcon compiled in. Every time
>>>>> X server exits another VMA will be leaked against the fbcon's
>>>>> frame buffer object.
>>>>>
>>>>> Also on systems where flink buffer sharing is used extensively,
>>>>> like Android, this leak has even more serious consequences.
>>>>>
>>>>> This version is takes a general approach from the  earlier work
>>>>> by Rafael Barbalho (drm/i915: Clean-up PPGTT on context
>>>>> destruction) and tries to incorporate the subsequent discussion
>>>>> between Chris Wilson and Daniel Vetter.
>>>>>
>>>>> v2:
>>>>>
>>>>> Removed immediate cleanup on object retire - it was causing a
>>>>> recursive VMA unbind via i915_gem_object_wait_rendering. And
>>>>> it is in fact not even needed since by definition context
>>>>> cleanup worker runs only after the last context reference has
>>>>> been dropped, hence all VMAs against the VM belonging to the
>>>>> context are already on the inactive list.
>>>>>
>>>>> v3:
>>>>>
>>>>> Previous version could deadlock since VMA unbind waits on any
>>>>> rendering on an object to complete. Objects can be busy in a
>>>>> different VM which would mean that the cleanup loop would do
>>>>> the wait with the struct mutex held.
>>>>>
>>>>> This is an even simpler approach where we just unbind VMAs
>>>>> without waiting since we know all VMAs belonging to this VM
>>>>> are idle, and there is nothing in flight, at the point
>>>>> context destructor runs.
>>>>>
>>>>> v4:
>>>>>
>>>>> Double underscore prefix for __915_vma_unbind_no_wait and a
>>>>> commit message typo fix. (Michel Thierry)
>>>>>
>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>> Testcase: igt/gem_ppgtt.c/flink-and-exit-vma-leak
>>>>> Reviewed-by: Michel Thierry <michel.thierry at intel.com>
>>>>> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
>>>>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>>>>> Cc: Rafael Barbalho <rafael.barbalho at intel.com>
>>>>> Cc: Michel Thierry <michel.thierry at intel.com>
>>>>
>>>> Queued for -next, thanks for the patch.
>>>
>>> Please no, it's an awful patch and does not even fix the root cause of
>>> the leak (that the vma are not closed when the handle is).
>>
>> It's a lose-lose situation for me as maintainer (and this holds in general
>> really, not just for this patch):
>> - Either I wield my considerable maintainer powers and force proper
>>    solution, which will piss of a lot of people.
>> - Or I just merge intermediate stuff and piss of another set of people
>>    (including likely our all future selves because we're slowly digging a
>>    tech debt grave).
>>
>> What I can't do with maintainer fu is force collaboration, I can only try
>> to piss off everyone equally. And today the die rolled a "merge".
>
> Pity it didn't roll "what's the impact and where is the bugzilla?" :)

There is this one:

https://bugs.freedesktop.org/show_bug.cgi?id=87729

And I could raise another one for leaking VMAs on X.org restart? :)

Tvrtko