[Intel-gfx] [PATCH v2 16/16] drm/i915: add i915_gem_context_get_reset_status_ioctl

Tue Mar 19 20:21:40 CET 2013

On Tue, Mar 19, 2013 at 12:02:48PM -0700, Ian Romanick wrote:
> On 03/19/2013 05:58 AM, Mika Kuoppala wrote:
> >Ian Romanick <idr at freedesktop.org> writes:
> >
> >>On 03/14/2013 08:52 AM, Mika Kuoppala wrote:
> >>>This ioctl returns context reset status for specified context.
> >>>
> >>>Signed-off-by: Mika Kuoppala <mika.kuoppala at intel.com>
> >>>CC: idr at freedesktop.org
> >>>---
> >>>   drivers/gpu/drm/i915/i915_dma.c |    1 +
> >>>   drivers/gpu/drm/i915/i915_drv.c |   61 +++++++++++++++++++++++++++++++++++++++
> >>>   drivers/gpu/drm/i915/i915_drv.h |    2 ++
> >>>   include/uapi/drm/i915_drm.h     |   28 ++++++++++++++++++
> >>>   4 files changed, 92 insertions(+)
> >>>
> >>>diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> >>>index 7902d97..c919832 100644
> >>>--- a/drivers/gpu/drm/i915/i915_dma.c
> >>>+++ b/drivers/gpu/drm/i915/i915_dma.c
> >>>@@ -1903,6 +1903,7 @@ struct drm_ioctl_desc i915_ioctls[] = {
> >>>   	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_CREATE, i915_gem_context_create_ioctl, DRM_UNLOCKED),
> >>>   	DRM_IOCTL_DEF_DRV(I915_GEM_CONTEXT_DESTROY, i915_gem_context_destroy_ioctl, DRM_UNLOCKED),
> >>>   	DRM_IOCTL_DEF_DRV(I915_REG_READ, i915_reg_read_ioctl, DRM_UNLOCKED),
> >>>+	DRM_IOCTL_DEF_DRV(I915_GET_RESET_STATUS, i915_gem_context_get_reset_status_ioctl, DRM_UNLOCKED),
> >>>   };
> >>>
> >>>   int i915_max_ioctl = DRM_ARRAY_SIZE(i915_ioctls);
> >>>diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> >>>index 69c9856..a4d06f2 100644
> >>>--- a/drivers/gpu/drm/i915/i915_drv.c
> >>>+++ b/drivers/gpu/drm/i915/i915_drv.c
> >>>@@ -1267,3 +1267,64 @@ int i915_reg_read_ioctl(struct drm_device *dev,
> >>>
> >>>   	return 0;
> >>>   }
> >>>+
> >>>+int i915_gem_context_get_reset_status_ioctl(struct drm_device *dev,
> >>>+					    void *data, struct drm_file *file)
> >>>+{
> >>>+	struct drm_i915_private *dev_priv = dev->dev_private;
> >>>+	struct intel_ring_buffer *ring;
> >>>+	struct drm_i915_reset_status *args = data;
> >>>+	struct ctx_reset_state *rs = NULL;
> >>>+	unsigned long reset_cnt;
> >>>+	u32 reset_status = I915_RESET_UNKNOWN;
> >>>+	int ret;
> >>>+
> >>>+	ret = mutex_lock_interruptible(&dev->struct_mutex);
> >>>+	if (ret)
> >>>+		return ret;
> >>>+
> >>>+	ring = &dev_priv->ring[RCS];
> >>>+
> >>>+	ret = i915_gem_context_get_reset_state(ring,
> >>>+					       file,
> >>>+					       args->ctx_id,
> >>>+					       &rs);
> >>>+	if (ret)
> >>>+		goto out;
> >>>+
> >>>+	BUG_ON(!rs);
> >>>+
> >>>+	reset_cnt = atomic_read(&dev_priv->gpu_error.reset_counter);
> >>>+
> >>>+	if (reset_cnt & I915_RESET_IN_PROGRESS_FLAG ||
> >>
> >>In this case, I believe we're supposed to return the reset state to the
> >>application.  The ARB_robustness spec says:
> >>
> >>      "If a reset status other than NO_ERROR is returned and subsequent
> >>      calls return NO_ERROR, the context reset was encountered and
> >>      completed. If a reset status is repeatedly returned, the context may
> >>      be in the process of resetting."
> >>
> >>If the reset takes a long time, it seems that even a well-behaved app
> >>could run afoul of the 'banned' logic.
> >
> >As there reset status is initialized to I915_RESET_UNKNOWN,
> >we return it if the reset is in progress or gpu is wedged.
> 
> Hmm... so user space will see I915_RESET_UNKNOWN until the reset is
> done, then it will (usually) see either I915_RESET_BATCH_ACTIVE or
> I915_RESET_BATCH_PENDING.  I think that should be okay.
> 
> >>>+	    reset_cnt == I915_WEDGED) {
> >>>+		goto out;
> >>>+	}
> >>>+
> >>>+	/* Set guilty/innocent status if only one reset was
> >>>+	 * observed and if only one guilty was found
> >>>+	 */
> >>>+	if ((rs->reset_cnt + 2) == reset_cnt &&
> >>>+	    (rs->guilty_cnt + 1) == dev_priv->gpu_error.guilty_cnt) {
> >>
> >>This logic seems... wrong, or at least weird.  "rs->reset_cnt + 2" is
> >>confusing next to "if only one reset was observed".
> >>
> >>dev_priv->gpu_error.reset_counter is the global GPU reset count since
> >>start-up, and rs->reset_cnt is the global GPU count since start-up when
> >>the context was created.  Right?
> >
> >Right. The confusing part in here is the
> >dev_priv->gpu_error.reset_counter. If it is odd, reset is in progress,
> >if it is even, the reset has been handled and all is well. That is why +2
> 
> That's a clever hack, I'm assuming, to use atomic operations instead
> of locks.   Dear God that's awful to understand... it's a tiny bit
> more clear looking back at the 'reset_cnt &
> I915_RESET_IN_PROGRESS_FLAG'. Perhaps we could get some wrapper
> macros RESET_IN_PROGRESS() and RESET_ACTUAL_COUNT() or something?

Those exist and are called i915_reset_in_progress and
i915_terminally_wedged (the later for the case where the gpu reset failed
and things are terminally hosed). Since the kernel thus far only cared
whether the reset state changed (either from good -> reset_in_progress,
back or to the terminal state) without ever missing a state transition, it
only compares the counter and doesn't care about the actual reset count
one bit. Hence why I didn't add another helper.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch