[Intel-gfx] [PATCH 2/3] drm/i915: Complete requests in nop_submit_request

Fri Nov 18 13:33:28 UTC 2016

On Fri, Nov 18, 2016 at 03:03:21PM +0200, Mika Kuoppala wrote:
> Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com> writes:
> 
> > On 18/11/2016 09:37, Chris Wilson wrote:
> >> Since the submit/execute split in commit d55ac5bf97c6 ("drm/i915: Defer
> >> transfer onto execution timeline to actual hw submission") the
> >> global seqno advance was deferred until the submit_request callback.
> >> After wedging the GPU, we were installing a nop_submit_request handler
> >> (to avoid waking up the dead hw) but I had missed converting this over
> >> to the new scheme. Under the new scheme, we have to explicitly call
> >> i915_gem_submit_request() from the submit_request handler to mark the
> >> request as on the hardware. If we don't the request is always pending,
> >> and any waiter will continue to wait indefinitely and hangcheck will not
> >> be able to resolve the lockup.
> >>
> >> References: https://bugs.freedesktop.org/show_bug.cgi?id=98748
> >> Testcase: igt/gem_eio/in-flight
> >> Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to actual hw submission")
> >> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >> ---
> >>  drivers/gpu/drm/i915/i915_gem.c | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >> index 7b9f5b99b0f3..7037a8b26903 100644
> >> --- a/drivers/gpu/drm/i915/i915_gem.c
> >> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >> @@ -2762,6 +2762,8 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
> >>
> >>  static void nop_submit_request(struct drm_i915_gem_request *request)
> >>  {
> >> +	i915_gem_request_submit(request);
> >> +	intel_engine_init_global_seqno(request->engine, request->global_seqno);
> >
> > Slight deja vu but not sure - we don't have a way of marking these as 
> > failed so what happens in practice here? This as at the point of no 
> > return, no replay, or allowing the context to recover or something?

This is past the point of no return. We failed to reset the hardware, so
we need to catch all the inflight requests and signal them. Treating
them in flight is nasty as they are a part of a giant web of
dependencies, so I wanted to have them just complete quietly as they
became ready (to be careful that we don't start dependent third party
work before other third party work finishes).

> I have another question. If we do this, why not
> do it (also) when we prepare to start in the postfix of failed
> request.

Different situation entirely, There we are fixing up the hw in such a
way that we want to avoid locks, specifically the struct_mutex, in the
future. (Probably moot now as we probably have the tools to avoid
struct_mutex when handling active requests.) As the hardware is working
we don't want to manually send the interrupt as that just causes a race
with the hardware (i.e. we will treat the request as free whilst it is
still in use by the hw). A very small but possible explosion.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre