[Intel-gfx] [PATCH] drm/i915: Stop asserting on set-wedged vs nop_submit_request ordering

Chris Wilson chris at chris-wilson.co.uk
Fri Oct 13 14:40:14 UTC 2017


Quoting Mika Kuoppala (2017-10-13 15:23:41)
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > Since the removal of the stop_machine(), it is allowed and expected for
> > the nop_submit_request() and nop_complete_submit_request() to run in
> > parallel to the i915_gem_set_wedged() processing. As such we can no
> > longer assert that i915_gem_set_wedged() has completed inside the
> > stop_machine prior to the individual nop_submit_request execution.
> >
> > Fixes: af7a8ffad9c5 ("drm/i915: Use rcu instead of stop_machine in set_wedged")
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Daniel Vetter <daniel.vetter at intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> 
> from irc:
> 17:12 < danvet> r-b: me
> 
> also,
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala at linux.intel.com>

Just a note to say we could, move the set_bit(WEDGED) first (followed by
a smb_mb__after_atomic()) so that the concurrent nop_submit_request
would see the right bit. However, moving that bit requires a bit more
thought wrt to the all the users and what it means for
i915_gem_set_wedged(). Simpler just to remove the incorrect BUG_ON for
now and address again in future.

The other challenge is hitting this race in testing. We have to
coordinate the requests becoming ready in parallel to the failed
reset. Something like queueing 100,000 requests and signaling them at
intervals of a couple of micros-econds would do the trick. That itself
is not too difficult, the remaining challenge will be in coordinating
that with the reset -- so using hangcheck is out the window and we must
trigger the EIO directly. Sounds easy!

Thanks for the review,
-Chris


More information about the Intel-gfx mailing list