[Intel-gfx] [PATCH 4/4] drm/i915: Late request cancellations are harmful

Chris Wilson chris at chris-wilson.co.uk
Mon Apr 11 14:19:52 UTC 2016


On Mon, Apr 11, 2016 at 02:50:17PM +0100, Tvrtko Ursulin wrote:
> 
> On 09/04/16 10:27, Chris Wilson wrote:
> >Conceptually, each request is a record of a hardware transaction - we
> >build up a list of pending commands and then either commit them to
> >hardware, or cancel them. However, whilst building up the list of
> >pending commands, we may modify state outside of the request and make
> >references to the pending request. If we do so and then cancel that
> >request, external objects then point to the deleted request leading to
> >both graphical and memory corruption.
> >
> >The easiest example is to consider object/VMA tracking. When we mark an
> >object as active in a request, we store a pointer to this, the most
> >recent request, in the object. Then we want to free that object, we wait
> >for the most recent request to be idle before proceeding (otherwise the
> >hardware will write to pages now owned by the system, or we will attempt
> >to read from those pages before the hardware is finished writing). If
> >the request was cancelled instead, that wait completes immediately. As a
> >result, all requests must be committed and not cancelled if the external
> >state is unknown.
> 
> This was a bit hard to figure out.
> 
> So we cannot unwind because once we set last_read_req we lose the
> data on what was the previous one, before this transaction started?
> 
> Intuitively I don't like the idea of sending unfinished stuff to the
> GPU, when it failed at some random point in ring buffer preparation.

Yes, but it is not unfinished though. The mm-switch is made and
flagged as completed, etc. The request does contain instructions for the
state changes it has made so far.
 
> So I am struggling with reviewing this as I have in the previous round.
> 
> >All that remains of i915_gem_request_cancel() users are just a couple of
> >extremely unlikely allocation failures, so remove the API entirely.
> 
> This parts feels extra weird because in the non-execbuf cases we
> actually can cancel the transaction without any issues, correct?

Nope. Same problem arises in that they do not know what happens
underneath calls to e.g. object_sync. Before cancellation you have to
inspect every path now and in the future to be sure that no state was
modified inside the request.
 
> Would middle-ground be to keep the cancellations for in-kernel
> submits, and for execbuf rewind the ringbuf so only request
> post-amble is sent to the GPU?

The only place that knows whether an external observer is the request
code. I am thinking about doing the cancellation there, I just need to
check that the lockless idling will be ok with that.

> >@@ -3410,12 +3404,9 @@ int i915_gpu_idle(struct drm_device *dev)
> >  				return PTR_ERR(req);
> >
> >  			ret = i915_switch_context(req);
> >-			if (ret) {
> >-				i915_gem_request_cancel(req);
> >-				return ret;
> >-			}
> >-
> >  			i915_add_request_no_flush(req);
> >+			if (ret)
> >+				return ret;
> 
> Looks like with this it could execute the context switch on the GPU
> but not update the engine->last_context in do_switch().

Hmm, that problem is present in the current code - requests are not
unwound, just deleted. We need to rearrange switch_context after the
mi_set_context() to ensure that on the subsequent call, the context is
reloaded.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre


More information about the Intel-gfx mailing list