[Intel-gfx] [PATCH 2/2] drm/i915/selftests: Improve error detection of reset failure
Chris Wilson
chris at chris-wilson.co.uk
Tue Mar 12 12:43:53 UTC 2019
Quoting Mika Kuoppala (2019-03-12 12:33:37)
> Chris Wilson <chris at chris-wilson.co.uk> writes:
>
> > Use a timedwait to promptly detect if the recovery after reset fails and
> > provide a meaningful debug dump.
> >
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > ---
> > .../gpu/drm/i915/selftests/intel_hangcheck.c | 18 +++++++++++++++++-
> > 1 file changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > index 10658ad05305..b5e35b2a925f 100644
> > --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > @@ -983,7 +983,23 @@ static int __igt_reset_engines(struct drm_i915_private *i915,
> > count++;
> >
> > if (rq) {
> > - i915_request_wait(rq, 0, MAX_SCHEDULE_TIMEOUT);
> > + if (i915_request_wait(rq, 0, HZ / 5) < 0) {
> > + struct drm_printer p =
> > + drm_info_printer(i915->drm.dev);
> > +
> > + pr_err("i915_reset_engine(%s:%s):"
> > + " failed to complete request after reset\n",
> > + engine->name, test_name);
> > + intel_engine_dump(engine, &p,
> > + "%s\n", engine->name);
> > + i915_request_put(rq);
> > +
> > + GEM_TRACE_DUMP();
>
> Would this be useful also in above, where we wait until request
> is running?
Hard to tell until we hit it. The big problem here is the trace is very
noisy; the important bits are in the engine dump which provide the
needle. And the trace may or may not have the information you are
looking for.
Generally, it's the post reset recovery that fails; once we are happy
the reset is ok, the next request is also likely to work. Or it may not,
hence the test!
In the case I was debugging, it wasn't so much a failure in reset
handling, but this test had the precise timing required to fool
unwind_incomplete_requests.
> > + i915_gem_set_wedged(i915);
> > + err = -EIO;
>
> Moving the *rq out from loop and then doing if(rq) i915_request_put(rq);
> out loop after exit, might be easier on error handling, shrug.
Or more convoluted. Didn't feel particularly happy.
-Chris
More information about the Intel-gfx
mailing list