[Intel-gfx] [PATCH v3] drm/i915: Ensure associated VMAs are inactive when contexts are destroyed

Thu Nov 19 01:42:17 PST 2015

On 19/11/15 09:17, Daniel Vetter wrote:
> On Wed, Nov 18, 2015 at 05:18:30PM +0000, Tvrtko Ursulin wrote:
>>
>> On 17/11/15 17:56, Daniel Vetter wrote:
>>> On Tue, Nov 17, 2015 at 05:24:01PM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 17/11/15 17:08, Daniel Vetter wrote:
>>>>> On Tue, Nov 17, 2015 at 04:54:50PM +0000, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 17/11/15 16:39, Daniel Vetter wrote:
>>>>>>> On Tue, Nov 17, 2015 at 04:27:12PM +0000, Tvrtko Ursulin wrote:
>>>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>>>>>
>>>>>>>> In the following commit:
>>>>>>>>
>>>>>>>>       commit e9f24d5fb7cf3628b195b18ff3ac4e37937ceeae
>>>>>>>>       Author: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>>>>>       Date:   Mon Oct 5 13:26:36 2015 +0100
>>>>>>>>
>>>>>>>>           drm/i915: Clean up associated VMAs on context destruction
>>>>>>>>
>>>>>>>> I added a WARN_ON assertion that VM's active list must be empty
>>>>>>>> at the time of owning context is getting freed, but that turned
>>>>>>>> out to be a wrong assumption.
>>>>>>>>
>>>>>>>> Due ordering of operations in i915_gem_object_retire__read, where
>>>>>>>> contexts are unreferenced before VMAs are moved to the inactive
>>>>>>>> list, the described situation can in fact happen.
>>>>>>>>
>>>>>>>> It feels wrong to do things in such order so this fix makes sure
>>>>>>>> a reference to context is held until the move to inactive list
>>>>>>>> is completed.
>>>>>>>>
>>>>>>>> v2: Rather than hold a temporary context reference move the
>>>>>>>>       request unreference to be the last operation. (Daniel Vetter)
>>>>>>>>
>>>>>>>> v3: Fix use after free. (Chris Wilson)
>>>>>>>>
>>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92638
>>>>>>>> Cc: Michel Thierry <michel.thierry at intel.com>
>>>>>>>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>>>>>>>> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
>>>>>>>> ---
>>>>>>>>    drivers/gpu/drm/i915/i915_gem.c | 33 ++++++++++++++++++---------------
>>>>>>>>    1 file changed, 18 insertions(+), 15 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>>>>> index 98c83286ab68..094ac17a712d 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>>>>> @@ -2404,29 +2404,32 @@ i915_gem_object_retire__read(struct drm_i915_gem_object *obj, int ring)
>>>>>>>>    	RQ_BUG_ON(!(obj->active & (1 << ring)));
>>>>>>>>
>>>>>>>>    	list_del_init(&obj->ring_list[ring]);
>>>>>>>> -	i915_gem_request_assign(&obj->last_read_req[ring], NULL);
>>>>>>>>
>>>>>>>>    	if (obj->last_write_req && obj->last_write_req->ring->id == ring)
>>>>>>>>    		i915_gem_object_retire__write(obj);
>>>>>>>>
>>>>>>>>    	obj->active &= ~(1 << ring);
>>>>>>>> -	if (obj->active)
>>>>>>>> -		return;
>>>>>>>
>>>>>>> 	if (obj->active) {
>>>>>>> 		i915_gem_request_assign(&obj->last_read_req[ring], NULL);
>>>>>>> 		return;
>>>>>>> 	}
>>>>>>>
>>>>>>> Would result in less churn in the code and drop the unecessary indent
>>>>>>> level. Also comment is missing as to why we need to do things in a
>>>>>>> specific order.
>>>>>>
>>>>>> Actually I think I changed my mind and that v1 is the way to go.
>>>>>>
>>>>>> Just re-ordering the code here still makes it possible for the context
>>>>>> destructor to run with VMAs on the active list I think.
>>>>>>
>>>>>> If we hold the context then it is 100% clear it is not possible.
>>>>>
>>>>> request_assign _is_ the function which adjust the refcounts for us, which
>>>>> means if we drop that reference too early then grabbing a temp reference
>>>>> is just papering over the real bug.
>>>>>
>>>>> Written out your patch looks something like
>>>>>
>>>>> 	a_reference(a);
>>>>> 	a_unreference(a);
>>>>>
>>>>> 	/* more cleanup code that should get run before a_unreference but isn't */
>>>>>
>>>>> 	a_unrefernce(a); /* for real this time */
>>>>>
>>>>> Unfortunately foo_assign is a new pattern and not well-established, so
>>>>> that connection isn't clear. Maybe we should rename it to
>>>>> foo_reference_assign to make it more obvious. Or just drop the pretense
>>>>> and open-code it since we unconditionally assign NULL as the new pointer
>>>>> value, and we know the current value of the pointer is non-NULL. So
>>>>> there's really no benefit to the helper here, it only obfuscates. And
>>>>> since that obfuscation tripped you up it's time to remove it ;-)
>>>>
>>>> Then foo_reference_unreference_assign. :)
>>>>
>>>> But seriously, I think it is more complicated that..
>>>>
>>>> The thing it trips over is that moving VMAs to inactive does not correspond
>>>> in time to request retirement. But in fact VMAs are moved to inactive only
>>>> when all requests associated with an object are done.
>>>>
>>>> This is the unintuitive thing I was working around. To make sure when
>>>> context destructor runs there are not active VMAs for that VM.
>>>>
>>>> I don't know how to guarantee that with what you propose. Perhaps I am
>>>> missing something?
>>>
>>> Ok, my example was slightly off, since we have 2 objects:
>>>
>>> 	b_reference(a->b);
>>> 	a_unreference(a); /* might unref a->b if it's the last reference */
>>>
>>> 	/* more cleanup code that should get run before a_unreference but isn't */
>>>
>>> 	b_unrefernce(a->b); /* for real this time */
>>>
>>> Holding the ref to a makes sure that b doesn't disappear. We rely on that
>>> in a fundamental way (a request really needs the ctx to stick around), and
>>> the bug really is that we drop the ref to a too early. That it's the
>>> releasing of a->b which is eventually blowing things up doesn't really
>>> matter.
>>>
>>> Btw would it be possible to have an igt for this? I should be possible to
>>> hit this with some varian of gem_unref_active_buffers.
>>
>> I was trying to do that today and it is proving to be a bit tricky.
>>
>> I need a blitter workload which will run for long enough for the retire
>> worker to run. So I'll try and build a bit bb tomorrow which will do that.
>>
>> Alternatively I did this:
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
>> index 6ed7d63a0688..db51e4b42a20 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
>> @@ -699,6 +699,8 @@ i915_gem_execbuffer_reserve(struct intel_engine_cs *ring,
>>          int retry;
>>
>>          i915_gem_retire_requests_ring(ring);
>> +       if (i915.enable_execlists)
>> +               intel_execlists_retire_requests(ring);
>>
>>          vm = list_first_entry(vmas, struct i915_vma, exec_list)->vm;
>>
>> And that enables me to trigger the WARN from my igt even with the current
>> shorter blitter workload (copy 64Mb).
>>
>> Going back to the original problem, how about something like this hunk for a fix?
>>
>> @@ -2413,19 +2416,36 @@ static void
>>   i915_gem_object_retire__read(struct drm_i915_gem_object *obj, int ring)
>>   {
>>          struct i915_vma *vma;
>> +       struct i915_hw_ppgtt *ppgtt;
>>
>>          RQ_BUG_ON(obj->last_read_req[ring] == NULL);
>>          RQ_BUG_ON(!(obj->active & (1 << ring)));
>>
>>          list_del_init(&obj->ring_list[ring]);
>> +
>> +       ppgtt = obj->last_read_req[ring]->ctx->ppgtt;
>> +       if (ppgtt) {
>> +               list_for_each_entry(vma, &obj->vma_list, vma_link) {
>> +                       if (vma->vm == &ppgtt->base &&
>> +                           !list_empty(&vma->mm_list)) {
>> +                               list_move_tail(&vma->mm_list,
>> +                                              &vma->vm->inactive_list);
>> +                       }
>> +               }
>> +       }
>> +
>>          i915_gem_request_assign(&obj->last_read_req[ring], NULL);
>>
>> This moves VMAs immediately to inactive as requests are retired and avoids
>> the problem with them staying on active for undefined amount of time.
>
> You can't put active objects onto the inactive list, i.e. the obj->active

Hm it is inactive in this VM so who would care?

Perhaps then the fix is simply to remove the 
WARN_ON(!list_empty(&ppgtt->base.active_list)) from the context destructor.

If there are active VMAs at that point, they'll get cleaned up when they 
are retired and there is no leak.

> check is non-optional. And the if (ppgtt) case is abstraction violation.
> I really don't get why we can't just move the unref to the right place ...

I don't see where.

Regards,

Tvrtko