[Intel-gfx] [PATCH v4 4/4] drm/i915: Fix premature LRC unpin in GuC mode

Thu Jan 21 04:32:10 PST 2016

On Thu, Jan 21, 2016 at 12:14:10PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> 
> In GuC mode LRC pinning lifetime depends exclusively on the
> request liftime. Since that is terminated by the seqno update
> that opens up a race condition between GPU finishing writing
> out the context image and the driver unpinning the LRC.
> 
> To extend the LRC lifetime we will employ a similar approach
> to what legacy ringbuffer submission does.
> 
> We will start tracking the last submitted context per engine
> and keep it pinned until it is replaced by another one.
> 
> Note that the driver unload path is a bit fragile and could
> benefit greatly from efforts to unify the legacy and exec
> list submission code paths.
> 
> At the moment i915_gem_context_fini has special casing for the
> two which are potentialy not needed, and also depends on
> i915_gem_cleanup_ringbuffer running before itself.
> 
> v2:
>  * Move pinning into engine->emit_request and actually fix
>    the reference/unreference logic. (Chris Wilson)
> 
>  * ring->dev can be NULL on driver unload so use a different
>    route towards it.
> 
> v3:
>  * Rebase.
>  * Handle the reset path. (Chris Wilson)
>  * Exclude default context from the pinning - it is impossible
>    to get it right before default context special casing in
>    general is eliminated.
> 
> v4:
>  * Rebased & moved context tracking to
>    intel_logical_ring_advance_and_submit.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Issue: VIZ-4277
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Nick Hoath <nicholas.hoath at intel.com>

Whilst it saddens me to see yet another (impossible) special case added
that will just have to be deleted again, the series is
Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>

I wonder if it is possible to poison the context objects before and
after, then do a deferred check for stray writes, and use that mode for
igt/gem_ctx_* (with some tests targetting active->idle vs
context-close). Would still be susceptible to timing as we need to
hit the interval between the seqno being complete and the delayed context
save, but that seems like the most reliable way to detect the error?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre