[Intel-gfx] [PATCH v2 13/22] drm/i915: Update reset path to fix incomplete requests
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Mon Oct 3 12:44:14 UTC 2016
On 07/09/2016 15:45, Chris Wilson wrote:
> Update reset path in preparation for engine reset which requires
> identification of incomplete requests and associated context and fixing
> their state so that engine can resume correctly after reset.
>
> The request that caused the hang will be skipped and head is reset to the
> start of breadcrumb. This allows us to resume from where we left-off.
> Since this request didn't complete normally we also need to cleanup elsp
> queue manually. This is vital if we employ nonblocking request
> submission where we may have a web of dependencies upon the hung request
> and so advancing the seqno manually is no longer trivial.
>
> ABI: gem_reset_stats / DRM_IOCTL_I915_GET_RESET_STATS
>
> We change the way we count pending batches. Only the active context
> involved in the reset is marked as either innocent or guilty, and not
> mark the entire world as pending. By inspection this only affects
> igt/gem_reset_stats (which assumes implementation details) and not
> piglit.
>
> ARB_robustness gives this guide on how we expect the user of this
> interface to behave:
>
> * Provide a mechanism for an OpenGL application to learn about
> graphics resets that affect the context. When a graphics reset
> occurs, the OpenGL context becomes unusable and the application
> must create a new context to continue operation. Detecting a
> graphics reset happens through an inexpensive query.
>
> And with regards to the actual meaning of the reset values:
>
> Certain events can result in a reset of the GL context. Such a reset
> causes all context state to be lost. Recovery from such events
> requires recreation of all objects in the affected context. The
> current status of the graphics reset state is returned by
>
> enum GetGraphicsResetStatusARB();
>
> The symbolic constant returned indicates if the GL context has been
> in a reset state at any point since the last call to
> GetGraphicsResetStatusARB. NO_ERROR indicates that the GL context
> has not been in a reset state since the last call.
> GUILTY_CONTEXT_RESET_ARB indicates that a reset has been detected
> that is attributable to the current GL context.
> INNOCENT_CONTEXT_RESET_ARB indicates a reset has been detected that
> is not attributable to the current GL context.
> UNKNOWN_CONTEXT_RESET_ARB indicates a detected graphics reset whose
> cause is unknown.
>
> The language here is explicit in that we must mark up the guilty batch,
> but is loose enough for us to relax the innocent (i.e. pending)
> accounting as only the active batches are involved with the reset.
>
> In the future, we are looking towards single engine resetting (with
> minimal locking), where it seems inappropriate to mark the entire world
> as innocent since the reset occurred on a different engine. Reducing the
> information available means we only have to encounter the pain once, and
> also reduces the information leaking from one context to another.
>
> v2: Legacy ringbuffer submission required a reset following hibernation,
> or else we restore stale values to the RING_HEAD and walked over
> stolen garbage.
>
> v3: GuC requires replaying the requests after a reset.
>
> v4: Restore engine IRQ after reset (so waiters will be woken!)
> Rearm hangcheck if resetting with a waiter.
>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> Cc: Arun Siluvery <arun.siluvery at linux.intel.com>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> ---
> drivers/gpu/drm/i915/i915_drv.c | 8 +-
> drivers/gpu/drm/i915/i915_drv.h | 5 +-
> drivers/gpu/drm/i915/i915_gem.c | 123 +++++++++++++++++------------
> drivers/gpu/drm/i915/i915_gem_context.c | 16 ----
> drivers/gpu/drm/i915/i915_guc_submission.c | 8 +-
> drivers/gpu/drm/i915/intel_engine_cs.c | 15 +++-
> drivers/gpu/drm/i915/intel_lrc.c | 49 ++++++++++--
> drivers/gpu/drm/i915/intel_lrc.h | 3 +-
> drivers/gpu/drm/i915/intel_ringbuffer.c | 47 +++++++----
> drivers/gpu/drm/i915/intel_ringbuffer.h | 7 +-
> 10 files changed, 183 insertions(+), 98 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index c1b890dbd6cc..2b0727d1467d 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -559,7 +559,6 @@ static void i915_gem_fini(struct drm_device *dev)
> }
>
> mutex_lock(&dev->struct_mutex);
> - i915_gem_reset(dev);
> i915_gem_cleanup_engines(dev);
> i915_gem_context_fini(dev);
> mutex_unlock(&dev->struct_mutex);
> @@ -1579,7 +1578,7 @@ static int i915_drm_resume(struct drm_device *dev)
> mutex_lock(&dev->struct_mutex);
> if (i915_gem_init_hw(dev)) {
> DRM_ERROR("failed to re-initialize GPU, declaring wedged!\n");
> - set_bit(I915_WEDGED, &dev_priv->gpu_error.flags);
> + i915_gem_set_wedged(dev_priv);
> }
> mutex_unlock(&dev->struct_mutex);
>
> @@ -1756,8 +1755,6 @@ int i915_reset(struct drm_i915_private *dev_priv)
>
> pr_notice("drm/i915: Resetting chip after gpu hang\n");
>
> - i915_gem_reset(dev);
> -
> ret = intel_gpu_reset(dev_priv, ALL_ENGINES);
> if (ret) {
> if (ret != -ENODEV)
> @@ -1767,6 +1764,7 @@ int i915_reset(struct drm_i915_private *dev_priv)
> goto error;
> }
>
> + i915_gem_reset(dev_priv);
> intel_overlay_reset(dev_priv);
>
> /* Ok, now get things going again... */
> @@ -1803,7 +1801,7 @@ out:
> return ret;
>
> error:
> - set_bit(I915_WEDGED, &error->flags);
> + i915_gem_set_wedged(dev_priv);
> goto out;
> }
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 2e2fd8a77233..a63bf820aa8f 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2029,6 +2029,7 @@ struct drm_i915_private {
>
> /* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
> struct {
> + void (*resume)(struct drm_i915_private *);
> void (*cleanup_engine)(struct intel_engine_cs *engine);
>
> /**
> @@ -3262,7 +3263,8 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
> return READ_ONCE(error->reset_count);
> }
>
> -void i915_gem_reset(struct drm_device *dev);
> +void i915_gem_reset(struct drm_i915_private *dev_priv);
> +void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
> bool i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
> int __must_check i915_gem_init(struct drm_device *dev);
> int __must_check i915_gem_init_hw(struct drm_device *dev);
> @@ -3391,7 +3393,6 @@ void i915_gem_object_save_bit_17_swizzle(struct drm_i915_gem_object *obj);
> int __must_check i915_gem_context_init(struct drm_device *dev);
> void i915_gem_context_lost(struct drm_i915_private *dev_priv);
> void i915_gem_context_fini(struct drm_device *dev);
> -void i915_gem_context_reset(struct drm_device *dev);
> int i915_gem_context_open(struct drm_device *dev, struct drm_file *file);
> void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
> int i915_switch_context(struct drm_i915_gem_request *req);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 23069a2d2850..65a69bbe021d 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2555,29 +2555,83 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
> return NULL;
> }
>
> -static void i915_gem_reset_engine_status(struct intel_engine_cs *engine)
> +static void reset_request(struct drm_i915_gem_request *request)
> +{
> + void *vaddr = request->ring->vaddr;
> + u32 head;
> +
> + /* As this request likely depends on state from the lost
> + * context, clear out all the user operations leaving the
> + * breadcrumb at the end (so we get the fence notifications).
> + */
> + head = request->head;
> + if (request->postfix < head) {
> + memset(vaddr + head, 0, request->ring->size - head);
> + head = 0;
> + }
> + memset(vaddr + head, 0, request->postfix - head);
> +}
> +
> +static void i915_gem_reset_engine(struct intel_engine_cs *engine)
> {
> struct drm_i915_gem_request *request;
> + struct i915_gem_context *incomplete_ctx;
> bool ring_hung;
>
> + /* Ensure irq handler finishes, and not run again. */
> + tasklet_kill(&engine->irq_tasklet);
> +
> request = i915_gem_find_active_request(engine);
> - if (request == NULL)
> + if (!request)
> return;
>
> ring_hung = engine->hangcheck.score >= HANGCHECK_SCORE_RING_HUNG;
> -
> i915_set_reset_status(request->ctx, ring_hung);
> + if (!ring_hung)
> + return;
> +
> + DRM_DEBUG_DRIVER("reseting %s to start from tail of request 0x%x\n",
> + engine->name, request->fence.seqno);
> +
> + /* Setup the CS to resume from the breadcrumb of the hung request */
> + engine->reset_hw(engine, request);
> +
> + /* Users of the default context do not rely on logical state
> + * preserved between batches. They have to emit full state on
> + * every batch and so it is safe to execute queued requests following
> + * the hang.
> + *
> + * Other contexts preserve state, now corrupt. We want to skip all
> + * queued requests that reference the corrupt context.
> + */
> + incomplete_ctx = request->ctx;
> + if (i915_gem_context_is_default(incomplete_ctx))
> + return;
> +
> list_for_each_entry_continue(request, &engine->request_list, link)
> - i915_set_reset_status(request->ctx, false);
> + if (request->ctx == incomplete_ctx)
> + reset_request(request);
> }
>
> -static void i915_gem_reset_engine_cleanup(struct intel_engine_cs *engine)
> +void i915_gem_reset(struct drm_i915_private *dev_priv)
> {
> - struct drm_i915_gem_request *request;
> - struct intel_ring *ring;
> + struct intel_engine_cs *engine;
>
> - /* Ensure irq handler finishes, and not run again. */
> - tasklet_kill(&engine->irq_tasklet);
> + i915_gem_retire_requests(dev_priv);
> +
> + for_each_engine(engine, dev_priv)
> + i915_gem_reset_engine(engine);
> +
> + i915_gem_restore_fences(&dev_priv->drm);
> +}
> +
> +static void nop_submit_request(struct drm_i915_gem_request *request)
> +{
> +}
> +
> +static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
I am aware this is a late comment, but wanted to say that the name above
is not ideal since we have both i915_gem_cleanup_engines and
dev_priv->gt.cleanup_engine which do completely different type of thing
than this.
Regards,
Tvrtko
More information about the Intel-gfx
mailing list