[Intel-gfx] [PATCH 4/4] drm/i915: Late request cancellations are harmful

Mon Apr 11 13:50:17 UTC 2016

On 09/04/16 10:27, Chris Wilson wrote:
> Conceptually, each request is a record of a hardware transaction - we
> build up a list of pending commands and then either commit them to
> hardware, or cancel them. However, whilst building up the list of
> pending commands, we may modify state outside of the request and make
> references to the pending request. If we do so and then cancel that
> request, external objects then point to the deleted request leading to
> both graphical and memory corruption.
>
> The easiest example is to consider object/VMA tracking. When we mark an
> object as active in a request, we store a pointer to this, the most
> recent request, in the object. Then we want to free that object, we wait
> for the most recent request to be idle before proceeding (otherwise the
> hardware will write to pages now owned by the system, or we will attempt
> to read from those pages before the hardware is finished writing). If
> the request was cancelled instead, that wait completes immediately. As a
> result, all requests must be committed and not cancelled if the external
> state is unknown.

This was a bit hard to figure out.

So we cannot unwind because once we set last_read_req we lose the data 
on what was the previous one, before this transaction started?

Intuitively I don't like the idea of sending unfinished stuff to the
GPU, when it failed at some random point in ring buffer preparation.

So I am struggling with reviewing this as I have in the previous round.

> All that remains of i915_gem_request_cancel() users are just a couple of
> extremely unlikely allocation failures, so remove the API entirely.

This parts feels extra weird because in the non-execbuf cases we 
actually can cancel the transaction without any issues, correct?

Would middle-ground be to keep the cancellations for in-kernel submits, 
and for execbuf rewind the ringbuf so only request post-amble is sent to 
the GPU?

> A consequence of committing all incomplete requests is that we generate
> excess breadcrumbs and fill the ring much more often with dummy work. We
> have completely undone the outstanding_last_seqno optimisation.
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93907
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
> Cc: stable at vger.kernel.org
> ---
>   drivers/gpu/drm/i915/i915_drv.h            |  2 --
>   drivers/gpu/drm/i915/i915_gem.c            | 50 ++++++++++++------------------
>   drivers/gpu/drm/i915/i915_gem_context.c    | 21 ++++++-------
>   drivers/gpu/drm/i915/i915_gem_execbuffer.c | 15 +++------
>   drivers/gpu/drm/i915/intel_display.c       |  2 +-
>   drivers/gpu/drm/i915/intel_lrc.c           |  4 +--
>   drivers/gpu/drm/i915/intel_overlay.c       |  8 ++---
>   7 files changed, 39 insertions(+), 63 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index a93e5dd4fa9a..f374db8de673 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2320,7 +2320,6 @@ struct drm_i915_gem_request {
>   struct drm_i915_gem_request * __must_check
>   i915_gem_request_alloc(struct intel_engine_cs *engine,
>   		       struct intel_context *ctx);
> -void i915_gem_request_cancel(struct drm_i915_gem_request *req);
>   void i915_gem_request_free(struct kref *req_ref);
>   int i915_gem_request_add_to_client(struct drm_i915_gem_request *req,
>   				   struct drm_file *file);
> @@ -2872,7 +2871,6 @@ int i915_gem_sw_finish_ioctl(struct drm_device *dev, void *data,
>   			     struct drm_file *file_priv);
>   void i915_gem_execbuffer_move_to_active(struct list_head *vmas,
>   					struct drm_i915_gem_request *req);
> -void i915_gem_execbuffer_retire_commands(struct i915_execbuffer_params *params);
>   int i915_gem_ringbuffer_submission(struct i915_execbuffer_params *params,
>   				   struct drm_i915_gem_execbuffer2 *args,
>   				   struct list_head *vmas);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 1c3ff56594d6..42227495803f 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2753,7 +2753,8 @@ __i915_gem_request_alloc(struct intel_engine_cs *engine,
>   		 * fully prepared. Thus it can be cleaned up using the proper
>   		 * free code.
>   		 */
> -		i915_gem_request_cancel(req);
> +		intel_ring_reserved_space_cancel(req->ringbuf);
> +		i915_gem_request_unreference(req);
>   		return ret;
>   	}
>
> @@ -2790,13 +2791,6 @@ i915_gem_request_alloc(struct intel_engine_cs *engine,
>   	return err ? ERR_PTR(err) : req;
>   }
>
> -void i915_gem_request_cancel(struct drm_i915_gem_request *req)
> -{
> -	intel_ring_reserved_space_cancel(req->ringbuf);
> -
> -	i915_gem_request_unreference(req);
> -}
> -
>   struct drm_i915_gem_request *
>   i915_gem_find_active_request(struct intel_engine_cs *engine)
>   {
> @@ -3410,12 +3404,9 @@ int i915_gpu_idle(struct drm_device *dev)
>   				return PTR_ERR(req);
>
>   			ret = i915_switch_context(req);
> -			if (ret) {
> -				i915_gem_request_cancel(req);
> -				return ret;
> -			}
> -
>   			i915_add_request_no_flush(req);
> +			if (ret)
> +				return ret;

Looks like with this it could execute the context switch on the GPU but 
not update the engine->last_context in do_switch().

Regards,

Tvrtko