[Intel-gfx] [RFC] drm/i915: reference count batch object on requests
Daniel Vetter
daniel at ffwll.ch
Wed Dec 4 13:08:56 CET 2013
On Wed, Dec 4, 2013 at 12:24 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> On Tue, Dec 03, 2013 at 06:10:05PM +0100, Daniel Vetter wrote:
>> On Mon, Dec 02, 2013 at 05:31:53PM +0200, Mika Kuoppala wrote:
>> > We used to lean on active_list to handle the references
>> > to batch objects. But there are useful cases when same,
>> > albeit simple, batch can be executing on multiple rings
>> > concurrently. For this case the active_list reference count
>> > handling is just not enough as batch could be freed by
>> > ring A request retirement as it is still running on ring B.
>> >
>> > Fix this by doing proper batch_obj reference counting.
>> >
>> > Signed-off-by: Mika Kuoppala <mika.kuoppala at intel.com>
>> >
>> > Notes:
>> > This is a patch which ameliorates the
>> > [PATCH] tests/gem_reset_stats: add close-pending-fork
>> >
>> > Chris wasn't happy about the refcounting as it might hide
>> > the true problem. But I haven't been able to find the real culprit,
>> > thus the RFC.
>>
>> I think I understand the bug now, and your patch looks like the correct
>> fix. But we need to pimp the commit message.
>>
>> In i915_gem_reset_ring_lists we reset requests and move objects to the
>> inactive list. Which means if the active list is the last one to hold a
>> reference, the object will disappear.
>>
>> Now the problem is that we do this per-ring, and not in the order that the
>> objects would have been retired if the gpu wouldn't have hung. E.g. if a
>> batch is active on both ring 1&2 but was last active on ring 1, then we'd
>> free it before we go ahead with cleaning up the requests for ring 2.
>>
>> So reference-counting the batch_obj pointers looks like the real fix.
>
> No. The bug only exists in i915_gem_reset() and should not impact
> anywhere else.
Well fixing the bug in i915_gem_reset would be lots more work and
rather fragile - we'd need to retire requests in the correct order.
That level of fanciness is generally not something I like to see in
less-tested code like the reset paths.
Also simply holding a reference for each pointer is established best
practice - atm we have a big comment explaining why it should work by
holding an implicit reference through the active list. With the
refcounting we'll simply trade that complexity (and the code comment
explaining things) in with another tricky case.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
More information about the Intel-gfx
mailing list