[Intel-gfx] [PATCH 4/4] drm/i915: Late request cancellations are harmful
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Mon Apr 11 13:50:17 UTC 2016
On 09/04/16 10:27, Chris Wilson wrote:
> Conceptually, each request is a record of a hardware transaction - we
> build up a list of pending commands and then either commit them to
> hardware, or cancel them. However, whilst building up the list of
> pending commands, we may modify state outside of the request and make
> references to the pending request. If we do so and then cancel that
> request, external objects then point to the deleted request leading to
> both graphical and memory corruption.
>
> The easiest example is to consider object/VMA tracking. When we mark an
> object as active in a request, we store a pointer to this, the most
> recent request, in the object. Then we want to free that object, we wait
> for the most recent request to be idle before proceeding (otherwise the
> hardware will write to pages now owned by the system, or we will attempt
> to read from those pages before the hardware is finished writing). If
> the request was cancelled instead, that wait completes immediately. As a
> result, all requests must be committed and not cancelled if the external
> state is unknown.
This was a bit hard to figure out.
So we cannot unwind because once we set last_read_req we lose the data
on what was the previous one, before this transaction started?
Intuitively I don't like the idea of sending unfinished stuff to the
GPU, when it failed at some random point in ring buffer preparation.
So I am struggling with reviewing this as I have in the previous round.
> All that remains of i915_gem_request_cancel() users are just a couple of
> extremely unlikely allocation failures, so remove the API entirely.
This parts feels extra weird because in the non-execbuf cases we
actually can cancel the transaction without any issues, correct?
Would middle-ground be to keep the cancellations for in-kernel submits,
and for execbuf rewind the ringbuf so only request post-amble is sent to
the GPU?
> A consequence of committing all incomplete requests is that we generate
> excess breadcrumbs and fill the ring much more often with dummy work. We
> have completely undone the outstanding_last_seqno optimisation.
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93907
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
> Cc: stable at vger.kernel.org
> ---
> drivers/gpu/drm/i915/i915_drv.h | 2 --
> drivers/gpu/drm/i915/i915_gem.c | 50 ++++++++++++------------------
> drivers/gpu/drm/i915/i915_gem_context.c | 21 ++++++-------
> drivers/gpu/drm/i915/i915_gem_execbuffer.c | 15 +++------
> drivers/gpu/drm/i915/intel_display.c | 2 +-
> drivers/gpu/drm/i915/intel_lrc.c | 4 +--
> drivers/gpu/drm/i915/intel_overlay.c | 8 ++---
> 7 files changed, 39 insertions(+), 63 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index a93e5dd4fa9a..f374db8de673 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2320,7 +2320,6 @@ struct drm_i915_gem_request {
> struct drm_i915_gem_request * __must_check
> i915_gem_request_alloc(struct intel_engine_cs *engine,
> struct intel_context *ctx);
> -void i915_gem_request_cancel(struct drm_i915_gem_request *req);
> void i915_gem_request_free(struct kref *req_ref);
> int i915_gem_request_add_to_client(struct drm_i915_gem_request *req,
> struct drm_file *file);
> @@ -2872,7 +2871,6 @@ int i915_gem_sw_finish_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file_priv);
> void i915_gem_execbuffer_move_to_active(struct list_head *vmas,
> struct drm_i915_gem_request *req);
> -void i915_gem_execbuffer_retire_commands(struct i915_execbuffer_params *params);
> int i915_gem_ringbuffer_submission(struct i915_execbuffer_params *params,
> struct drm_i915_gem_execbuffer2 *args,
> struct list_head *vmas);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 1c3ff56594d6..42227495803f 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2753,7 +2753,8 @@ __i915_gem_request_alloc(struct intel_engine_cs *engine,
> * fully prepared. Thus it can be cleaned up using the proper
> * free code.
> */
> - i915_gem_request_cancel(req);
> + intel_ring_reserved_space_cancel(req->ringbuf);
> + i915_gem_request_unreference(req);
> return ret;
> }
>
> @@ -2790,13 +2791,6 @@ i915_gem_request_alloc(struct intel_engine_cs *engine,
> return err ? ERR_PTR(err) : req;
> }
>
> -void i915_gem_request_cancel(struct drm_i915_gem_request *req)
> -{
> - intel_ring_reserved_space_cancel(req->ringbuf);
> -
> - i915_gem_request_unreference(req);
> -}
> -
> struct drm_i915_gem_request *
> i915_gem_find_active_request(struct intel_engine_cs *engine)
> {
> @@ -3410,12 +3404,9 @@ int i915_gpu_idle(struct drm_device *dev)
> return PTR_ERR(req);
>
> ret = i915_switch_context(req);
> - if (ret) {
> - i915_gem_request_cancel(req);
> - return ret;
> - }
> -
> i915_add_request_no_flush(req);
> + if (ret)
> + return ret;
Looks like with this it could execute the context switch on the GPU but
not update the engine->last_context in do_switch().
Regards,
Tvrtko
More information about the Intel-gfx
mailing list