[Intel-gfx] [RFC 1/6] drm/i915: Individual request cancellation

Tue Mar 16 10:02:09 UTC 2021

On Mon, Mar 15, 2021 at 05:37:27PM +0000, Tvrtko Ursulin wrote:
> 
> On 12/03/2021 15:46, Tvrtko Ursulin wrote:
> > From: Chris Wilson <chris at chris-wilson.co.uk>
> > 
> > Currently, we cancel outstanding requests within a context when the
> > context is closed. We may also want to cancel individual requests using
> > the same graceful preemption mechanism.
> > 
> > v2 (Tvrtko):
> >   * Cancel waiters carefully considering no timeline lock and RCU.
> >   * Fixed selftests.
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> 
> [snip]
> 
> > +void i915_request_cancel(struct i915_request *rq, int error)
> > +{
> > +	if (!i915_request_set_error_once(rq, error))
> > +		return;
> > +
> > +	set_bit(I915_FENCE_FLAG_SENTINEL, &rq->fence.flags);
> > +
> > +	if (i915_sw_fence_signaled(&rq->submit)) {
> > +		struct i915_dependency *p;
> > +
> > +restart:
> > +		rcu_read_lock();
> > +		for_each_waiter(p, rq) {
> > +			struct i915_request *w =
> > +				container_of(p->waiter, typeof(*w), sched);
> > +
> > +			if (__i915_request_is_complete(w) ||
> > +			    fatal_error(w->fence.error))
> > +				continue;
> > +
> > +			w = i915_request_get(w);
> > +			rcu_read_unlock();
> > +			/* Recursion bound by the number of engines */
> > +			i915_request_cancel(w, error);
> > +			i915_request_put(w);
> > +
> > +			/* Restart after having to drop rcu lock. */
> > +			goto restart;
> > +		}
> 
> So I need to fix this error propagation to waiters in order to avoid
> potential stack overflow caught in shards (gem_ctx_ringsize).
> 
> Or alternatively we decide not to propagate fence errors. Not sure that
> consequences either way are particularly better or worse. Things will break
> anyway since what userspace inspects for unexpected fence errors?!

fence error propagation is one of these "sounds like a good idea" things
that turned into a can of worms. See the recent revert Jason submitted, I
replied with a  more in-depth discussion.

So I'd say if we don't need this internally somehow for scheduler state,
remove it. Maybe even the entire scaffolding we have for the forwarding.

Maybe best if you sync with Jason here, we need to stuff Jason's patch
into -fixes since there's a pretty bad regression going on. I think Jason
also said there's a pile of igts to remove once we give up on fence error
propagation.

> So rendering corruption more or less. Can it cause a further stream of GPU
> hangs I am not sure. Only if there is a inter-engine data dependency
> involving data more complex than images/textures.

Yup. Also at least on modern-ish hw our userspace goes with
non-recoverable contexts anyway, because everything needs to be
reconstructed. vk is even more brutal, it just hands you back a
vk_device_lost and everything is gone (textures, data, all api objects,
really everything afaiui). Trying to continue is something only old
userspace is doing, because the fully emit the entire ctx state at the
start of each batch anyway.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch