[Intel-gfx] [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout

Tue Mar 28 08:31:16 UTC 2017

On Mon, Mar 27, 2017 at 06:03:37PM -0700, Michel Thierry wrote:
> On 25/03/17 02:38, Chris Wilson wrote:
> >On Fri, Mar 24, 2017 at 06:30:09PM -0700, Michel Thierry wrote:
> >>diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> >>index b43c37a911bb..1741584d858f 100644
> >>--- a/drivers/gpu/drm/i915/i915_drv.h
> >>+++ b/drivers/gpu/drm/i915/i915_drv.h
> >>@@ -1039,6 +1039,7 @@ struct i915_gpu_state {
> >> 			int ban_score;
> >> 			int active;
> >> 			int guilty;
> >>+			int watchdog_threshold;
> >
> >Shouldn't this be added in the patch adding it to the context for real?
> >
> 
> I wanted to wait until I could print it in the error_state using
> watchdog_to_us (which is added until this patch).
> I can also move all the i915_gpu_error.c changes to a new patch.

Probably the best compromise.

> >>+/* Return the timer count threshold in microseconds. */
> >>+int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
> >>+				  struct drm_i915_gem_context_param *args)
> >>+{
> >>+	struct drm_i915_private *dev_priv = ctx->i915;
> >>+	struct intel_engine_cs *engine;
> >>+	enum intel_engine_id id;
> >>+	u32 threshold_in_us[I915_NUM_ENGINES];
> >
> >Reading is a 2 step process. First CONTEXT_GET_PARAM passes in size==0
> >and gets told by the kernel what size array to allocate. Second pass
> >supplies a buffer with that size. (Or userspace can preallocate a large
> >enough buffer, declare it's full size and let the kernel fill as much as
> >it wants.)
> >
> >if (args->size == 0)
> >	goto out;
> >
> >if (args->size < sizeof(threshold_in_us))
> >	return -EFAULT;
> EFAULT or EINVAL?

EFAULT. The user buffer is not an absolute fixed requirement, so too
small a buffer is an overrun -> fault.

> >>+
> >>+	if (!dev_priv->engine[VCS]->emit_start_watchdog)
> >>+		return -ENODEV;
> >>+
> >>+	for_each_engine(engine, dev_priv, id) {
> >>+		struct intel_context *ce = &ctx->engine[id];
> >>+
> >>+		threshold_in_us[id] = watchdog_to_us(ce->watchdog_threshold);
> >>+	}
> >>+
> >>+	mutex_unlock(&dev_priv->drm.struct_mutex);
> >
> >Grr. We should look at why we have this lock here in the first place.
> >IIRC, it was there to make TRTT easier, but we can always move the
> >burden of work again.
> >
> 
> It helps TRTT (that code also takes an extra reference of the
> ctx)... but it all started in i915_gem_context_lookup (499f2697da1d)

It is just laziness that we kept it around the switch, and part of the
reason for that laziness to remain was trtt. Note, we don't need the
mutex around the context lookup (or unref) with a bit of jiggery-pokery.

> >>+	mutex_unlock(&dev_priv->drm.struct_mutex);
> >>+	if (copy_from_user(&threshold_in_us,
> >>+			   u64_to_user_ptr(args->value),
> >>+			   sizeof(threshold_in_us))) {
> >>+		mutex_lock(&dev_priv->drm.struct_mutex);
> >>+		return -EFAULT;
> >>+	}
> >>+	mutex_lock(&dev_priv->drm.struct_mutex);
> >>+
> >>+	/* not supported in blitter engine */
> >>+	if (threshold_in_us[BCS] != 0)
> >>+		return -EINVAL;
> >>+
> >>+	for_each_engine(engine, dev_priv, id) {
> >>+		struct intel_context *ce = &ctx->engine[id];
> >>+
> >>+		ce->watchdog_threshold =
> >>+			watchdog_to_clock_counts((u64)threshold_in_us[id]);
> >
> >Cast is superfluous.
> >
> 
> without it, the operation in watchdog_to_clock_counts was always
> casted to u32.

static inline u32 watchdog_to_clock_counts(u64 value_in_us) ? 

> >>+static inline u32 watchdog_to_clock_counts(u64 value_in_us)
> >>+{
> >>+	u64 threshold = (value_in_us) * (TIMESTAMP_CNTS_PER_USEC);
> >>+
> >>+	if (GEM_WARN_ON(overflows_type(threshold, u32)))
> >
> >Nice idea, and yes we should do the range check. But this is a userspace
> >interface, so always do it and don't warn, just -EINVAL.
> 
> What if it fails/overflows after some engines' thresholds have been
> set, should it set them back to 0's or leave them enable?

Yes. Validate userspace first, then apply, so the set of changes is
atomic and the ret is either success or EINVAL.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre