[Intel-gfx] [PATCH v2 13/22] drm/i915: Update reset path to fix incomplete requests

Mon Oct 3 12:44:14 UTC 2016

On 07/09/2016 15:45, Chris Wilson wrote:
> Update reset path in preparation for engine reset which requires
> identification of incomplete requests and associated context and fixing
> their state so that engine can resume correctly after reset.
>
> The request that caused the hang will be skipped and head is reset to the
> start of breadcrumb. This allows us to resume from where we left-off.
> Since this request didn't complete normally we also need to cleanup elsp
> queue manually. This is vital if we employ nonblocking request
> submission where we may have a web of dependencies upon the hung request
> and so advancing the seqno manually is no longer trivial.
>
> ABI: gem_reset_stats / DRM_IOCTL_I915_GET_RESET_STATS
>
> We change the way we count pending batches. Only the active context
> involved in the reset is marked as either innocent or guilty, and not
> mark the entire world as pending. By inspection this only affects
> igt/gem_reset_stats (which assumes implementation details) and not
> piglit.
>
> ARB_robustness gives this guide on how we expect the user of this
> interface to behave:
>
>   * Provide a mechanism for an OpenGL application to learn about
>     graphics resets that affect the context.  When a graphics reset
>     occurs, the OpenGL context becomes unusable and the application
>     must create a new context to continue operation. Detecting a
>     graphics reset happens through an inexpensive query.
>
> And with regards to the actual meaning of the reset values:
>
>     Certain events can result in a reset of the GL context. Such a reset
>     causes all context state to be lost. Recovery from such events
>     requires recreation of all objects in the affected context. The
>     current status of the graphics reset state is returned by
>
> 	enum GetGraphicsResetStatusARB();
>
>     The symbolic constant returned indicates if the GL context has been
>     in a reset state at any point since the last call to
>     GetGraphicsResetStatusARB. NO_ERROR indicates that the GL context
>     has not been in a reset state since the last call.
>     GUILTY_CONTEXT_RESET_ARB indicates that a reset has been detected
>     that is attributable to the current GL context.
>     INNOCENT_CONTEXT_RESET_ARB indicates a reset has been detected that
>     is not attributable to the current GL context.
>     UNKNOWN_CONTEXT_RESET_ARB indicates a detected graphics reset whose
>     cause is unknown.
>
> The language here is explicit in that we must mark up the guilty batch,
> but is loose enough for us to relax the innocent (i.e. pending)
> accounting as only the active batches are involved with the reset.
>
> In the future, we are looking towards single engine resetting (with
> minimal locking), where it seems inappropriate to mark the entire world
> as innocent since the reset occurred on a different engine. Reducing the
> information available means we only have to encounter the pain once, and
> also reduces the information leaking from one context to another.
>
> v2: Legacy ringbuffer submission required a reset following hibernation,
> or else we restore stale values to the RING_HEAD and walked over
> stolen garbage.
>
> v3: GuC requires replaying the requests after a reset.
>
> v4: Restore engine IRQ after reset (so waiters will be woken!)
>      Rearm hangcheck if resetting with a waiter.
>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> Cc: Arun Siluvery <arun.siluvery at linux.intel.com>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/i915_drv.c            |   8 +-
>   drivers/gpu/drm/i915/i915_drv.h            |   5 +-
>   drivers/gpu/drm/i915/i915_gem.c            | 123 +++++++++++++++++------------
>   drivers/gpu/drm/i915/i915_gem_context.c    |  16 ----
>   drivers/gpu/drm/i915/i915_guc_submission.c |   8 +-
>   drivers/gpu/drm/i915/intel_engine_cs.c     |  15 +++-
>   drivers/gpu/drm/i915/intel_lrc.c           |  49 ++++++++++--
>   drivers/gpu/drm/i915/intel_lrc.h           |   3 +-
>   drivers/gpu/drm/i915/intel_ringbuffer.c    |  47 +++++++----
>   drivers/gpu/drm/i915/intel_ringbuffer.h    |   7 +-
>   10 files changed, 183 insertions(+), 98 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index c1b890dbd6cc..2b0727d1467d 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -559,7 +559,6 @@ static void i915_gem_fini(struct drm_device *dev)
>   	}
>   
>   	mutex_lock(&dev->struct_mutex);
> -	i915_gem_reset(dev);
>   	i915_gem_cleanup_engines(dev);
>   	i915_gem_context_fini(dev);
>   	mutex_unlock(&dev->struct_mutex);
> @@ -1579,7 +1578,7 @@ static int i915_drm_resume(struct drm_device *dev)
>   	mutex_lock(&dev->struct_mutex);
>   	if (i915_gem_init_hw(dev)) {
>   		DRM_ERROR("failed to re-initialize GPU, declaring wedged!\n");
> -		set_bit(I915_WEDGED, &dev_priv->gpu_error.flags);
> +		i915_gem_set_wedged(dev_priv);
>   	}
>   	mutex_unlock(&dev->struct_mutex);
>   
> @@ -1756,8 +1755,6 @@ int i915_reset(struct drm_i915_private *dev_priv)
>   
>   	pr_notice("drm/i915: Resetting chip after gpu hang\n");
>   
> -	i915_gem_reset(dev);
> -
>   	ret = intel_gpu_reset(dev_priv, ALL_ENGINES);
>   	if (ret) {
>   		if (ret != -ENODEV)
> @@ -1767,6 +1764,7 @@ int i915_reset(struct drm_i915_private *dev_priv)
>   		goto error;
>   	}
>   
> +	i915_gem_reset(dev_priv);
>   	intel_overlay_reset(dev_priv);
>   
>   	/* Ok, now get things going again... */
> @@ -1803,7 +1801,7 @@ out:
>   	return ret;
>   
>   error:
> -	set_bit(I915_WEDGED, &error->flags);
> +	i915_gem_set_wedged(dev_priv);
>   	goto out;
>   }
>   
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 2e2fd8a77233..a63bf820aa8f 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2029,6 +2029,7 @@ struct drm_i915_private {
>   
>   	/* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
>   	struct {
> +		void (*resume)(struct drm_i915_private *);
>   		void (*cleanup_engine)(struct intel_engine_cs *engine);
>   
>   		/**
> @@ -3262,7 +3263,8 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
>   	return READ_ONCE(error->reset_count);
>   }
>   
> -void i915_gem_reset(struct drm_device *dev);
> +void i915_gem_reset(struct drm_i915_private *dev_priv);
> +void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
>   bool i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
>   int __must_check i915_gem_init(struct drm_device *dev);
>   int __must_check i915_gem_init_hw(struct drm_device *dev);
> @@ -3391,7 +3393,6 @@ void i915_gem_object_save_bit_17_swizzle(struct drm_i915_gem_object *obj);
>   int __must_check i915_gem_context_init(struct drm_device *dev);
>   void i915_gem_context_lost(struct drm_i915_private *dev_priv);
>   void i915_gem_context_fini(struct drm_device *dev);
> -void i915_gem_context_reset(struct drm_device *dev);
>   int i915_gem_context_open(struct drm_device *dev, struct drm_file *file);
>   void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
>   int i915_switch_context(struct drm_i915_gem_request *req);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 23069a2d2850..65a69bbe021d 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2555,29 +2555,83 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
>   	return NULL;
>   }
>   
> -static void i915_gem_reset_engine_status(struct intel_engine_cs *engine)
> +static void reset_request(struct drm_i915_gem_request *request)
> +{
> +	void *vaddr = request->ring->vaddr;
> +	u32 head;
> +
> +	/* As this request likely depends on state from the lost
> +	 * context, clear out all the user operations leaving the
> +	 * breadcrumb at the end (so we get the fence notifications).
> +	 */
> +	head = request->head;
> +	if (request->postfix < head) {
> +		memset(vaddr + head, 0, request->ring->size - head);
> +		head = 0;
> +	}
> +	memset(vaddr + head, 0, request->postfix - head);
> +}
> +
> +static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>   {
>   	struct drm_i915_gem_request *request;
> +	struct i915_gem_context *incomplete_ctx;
>   	bool ring_hung;
>   
> +	/* Ensure irq handler finishes, and not run again. */
> +	tasklet_kill(&engine->irq_tasklet);
> +
>   	request = i915_gem_find_active_request(engine);
> -	if (request == NULL)
> +	if (!request)
>   		return;
>   
>   	ring_hung = engine->hangcheck.score >= HANGCHECK_SCORE_RING_HUNG;
> -
>   	i915_set_reset_status(request->ctx, ring_hung);
> +	if (!ring_hung)
> +		return;
> +
> +	DRM_DEBUG_DRIVER("reseting %s to start from tail of request 0x%x\n",
> +			 engine->name, request->fence.seqno);
> +
> +	/* Setup the CS to resume from the breadcrumb of the hung request */
> +	engine->reset_hw(engine, request);
> +
> +	/* Users of the default context do not rely on logical state
> +	 * preserved between batches. They have to emit full state on
> +	 * every batch and so it is safe to execute queued requests following
> +	 * the hang.
> +	 *
> +	 * Other contexts preserve state, now corrupt. We want to skip all
> +	 * queued requests that reference the corrupt context.
> +	 */
> +	incomplete_ctx = request->ctx;
> +	if (i915_gem_context_is_default(incomplete_ctx))
> +		return;
> +
>   	list_for_each_entry_continue(request, &engine->request_list, link)
> -		i915_set_reset_status(request->ctx, false);
> +		if (request->ctx == incomplete_ctx)
> +			reset_request(request);
>   }
>   
> -static void i915_gem_reset_engine_cleanup(struct intel_engine_cs *engine)
> +void i915_gem_reset(struct drm_i915_private *dev_priv)
>   {
> -	struct drm_i915_gem_request *request;
> -	struct intel_ring *ring;
> +	struct intel_engine_cs *engine;
>   
> -	/* Ensure irq handler finishes, and not run again. */
> -	tasklet_kill(&engine->irq_tasklet);
> +	i915_gem_retire_requests(dev_priv);
> +
> +	for_each_engine(engine, dev_priv)
> +		i915_gem_reset_engine(engine);
> +
> +	i915_gem_restore_fences(&dev_priv->drm);
> +}
> +
> +static void nop_submit_request(struct drm_i915_gem_request *request)
> +{
> +}
> +
> +static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)

I am aware this is a late comment, but wanted to say that the name above 
is not ideal since we have both i915_gem_cleanup_engines and 
dev_priv->gt.cleanup_engine which do completely different type of thing 
than this.

Regards,

Tvrtko